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Abstract 

Multisensor  data  fusion  is  presented  in  a  rigorous  mathematical  format,  with  defini¬ 
tions  consistent  with  the  desires  of  the  data  fusion  community.  In  particular,  a  model 
of  event-state  fusion  is  developed  and  described,  concluding  that  there  are  two  types  of 
models  on  which  to  base  fusion  (in  the  literature  referred  to  as  within  fusion  and  across 
fusion).  Six  different  types  of  fusion  are  shown  to  exist,  with  respect  to  the  model,  us¬ 
ing  category  theory.  Definitions  of  fusion  rules  and  fusors  are  introduced,  along  with 
the  functor  categories,  of  which  they  are  objects.  Defining  fusors  and  competing  fu¬ 
sion  rules  involves  the  use  of  an  objective  function  of  the  researchers  choice.  One  such 
objective  function,  a  functional  on  families  of  classification  systems,  and  in  particular,  re¬ 
ceiver  operating  characteristics  (ROCs),  is  introduced.  Its  use  as  an  objective  function  is 
demonstrated  in  that  the  argument  which  minimizes  it  (a  particular  ROC),  corresponds  to 
the  Bayes  Optimal  threshold,  given  certain  assumptions,  within  a  family  of  classification 
systems.  This  is  proven  using  a  calculus  of  variations  approach  using  ROC  curves  as  a 
constraint.  This  constraint  is  extended  to  ROC  manifolds,  in  particular,  topological  sub¬ 
spaces  of  Mn.  These  optimal  points  can  be  found  analytically  if  the  closed  form  of  the 
ROC  manifold  is  known,  or  calculated  from  the  functional  (as  the  minimizing  argument) 
when  a  finite  number  of  points  are  available  for  comparison  in  a  family  of  classification 
systems.  Under  different  data  assumptions,  the  minimizing  argument  of  the  ROC  func¬ 
tional  is  shown  to  be  the  point  of  a  ROC  manifold  corresponding  to  the  Neyman-Pearson 
criteria.  A  second  functional,  the  norm,  is  shown  to  determine  the  min-max  threshold. 
Finally,  more  robust  functionals  can  be  developed  from  the  offered  functionals. 
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THE  APPLICATION  OF  CATEGORY  THEORY  AND  ANALYSIS 


OF  RECEIVER  OPERATING  CHARACTERISTICS  TO 
INFORMATION  FUSION 


I.  Introduction 

1.1  Problem  Statement 

Data  fusion  as  a  science  has  been  rapidly  developing  since  the  1980’s.  Fusion  liter¬ 
ature  encompasses  many  aspects  of  data  fusion  from  mathematical  techniques  [8, 15,55] 
to  technologies,  how  to  register  and  align  data,  as  well  as  resource  management  of  the 
assets  to  be  used.  The  Joint  Directors  of  Laboratories  Data  Fusion  Subpanel  (JDL)  has 
put  out  guidance  in  the  form  of  a  functional  model  (which  we  will  review  later).  What  is 
missing?  A  clear  definition  of  what  fusion  is  in  a  mathematical  sense.  While  many  math¬ 
ematical  techniques  have  been  developed  and  compiled  ,  one  look  at  the  spread  and  variety 
of  sub-processes  such  as  sensor  fusion,  data  fusion,  and  classifier  fusion  (all  of  which  can 
be  identified  by  other  names)  demonstrates  the  lack  of  unity  within  the  science.  As  late 
as  2001,  the  Handbook  of  Multisensor  Data  Fusion  [15],  includes  a  recommendation  that 
data  fusion  be  defined  as 

the  process  of  combining  data  or  information  to  estimate  or  predict  entity 
states. 

This  is  an  improvement  over  the  Handbook’s  previous  version,  but  what  are  the  mathe¬ 
matical  formulations  for  fusion?  How  shall  we  define  the  technology?  For  example,  does 
it  matter  how  data  or  information  are  combined?  What  is  meant  by  data  or  information? 
Does  the  estimation  or  prediction  of  entity  states  need  to  conform  to  some  standards  of 
accuracy  or  reliability  to  be  called  fusion?  Are  there  clear  delineations  of  different  types 
of  fusion  or  is  all  fusion  the  same?  How  can  we  mathematically  define  and  compare  dif- 
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ferences?  This  dissertation  will  explore  these  questions,  but  will  focus  on  the  following 
problem: 

An  entity  (say  some  corporation)  wants  to  combine  some  sets  of  constructed  infor¬ 
mation  (or  data)  into  a  new  set  of  symbols  which  clarifies  the  object  from  which  the  in¬ 
formation  (or  data)  originated.  The  technology  developed  includes  a  finite  number  of 
algorithms  to  compute  the  combinations.  The  entity  has  two  problems  it  would  like  to 
address: 

1.  In  documenting  its  efforts,  writing  patent  applications,  conversing  with  the  fusion 
community,  and  contracting  for  technologies  from  other  entities,  it  needs  a  common 
framework  (preferrably  quantitative  in  nature)  to  accomplish  these  tasks. 

2.  How  does  the  entity  compete  the  algorithms  to  ensure  it  is  getting  the  most  for  its 
investment? 

In  particular,  we  envision  developing  a  rigorous  mathematical  lexicon  for  the  US  Air  Force 
to  use  in  creating  documents  contracting  for  fusion  technologies.  Although  the  definitions 
will  be  structured  from  abstract  mathematical  ideas,  the  vocabulary  will  be  rather  intuitive 
in  nature.  Furthermore,  we  present  one  concept  of  how  to  compete  fusion  technologies. 
Main  mathematical  results  are  identified  as  theorems,  lemmas,  and  corollaries. 

Since  information  fusion  is  a  rapidly  advancing  science,  researchers  are  daily  adding 
to  the  known  repertoire  of  fusion  techniques  (that  is,  fusion  rules);  however,  a  method¬ 
ology  to  define  what  fusion  is  and  when  it  has  actually  occurred  has  not  been  widely 
discussed  or  identified  in  the  literature.  An  organization  that  is  building  a  fusion  sys¬ 
tem  to  detect  or  identify  objects  using  existing  assets  or  those  yet  to  be  constructed  will 
want  to  get  the  best  possible  result  for  the  money  expended.  It  is  this  goal  which  moti¬ 
vates  the  need  to  construct  a  way  to  compete  various  fusion  rules  for  acquisition  purposes. 
There  are  many  different  methods  and  strategies  involved  with  developing  classification 
systems.  Some  rely  on  likelihood  ratios,  some  on  randomized  techniques,  and  still  others 
with  a  myriad  of  schemes.  To  add  to  this,  there  exists  the  fusion  of  all  these  technolo¬ 
gies  which  create  even  more  classification  systems.  Since  receiver  operating  characteris- 
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tic  (ROC)  curves  can  be  developed  for  each  system  under  test  conditions,  we  propose  a 
functional  defined  on  ROC  curves  as  a  method  of  quantifying  the  performance  of  a  classi¬ 
fication  system.  This  functional  then  allows  for  the  development  of  a  cogent  definition  of 
what  is  fusion  ( i.e .,  the  difference  between  fusion  rules,  which  do  not  have  a  reliance  upon 
any  qualitative  difference  between  the  ‘new’  fused  result  and  the  ‘old’  non-fused  result) 
and  what  we  term  fusors  (a  subcategory  of  fusion  rules),  which  do  rely  upon  the  qualita¬ 
tive  differences.  While  the  development  of  some  classification  systems  require  knowledge 
of  class  conditional  probability  density  functions,  others  do  not.  A  testing  organization 
would  not  reveal  the  exact  test  scenario  to  those  proposing  different  classification  systems 
a  priori  the  test.  Therefore,  even  those  systems  relying  upon  class  conditional  density 
knowledge  a  priori  can  at  best  estimate  the  test  scenario  (and  by  extension  the  operational 
conditions  the  system  will  find  itself  used  in  later!). 

The  functional  we  propose  allows  a  researcher  (or  tester)  who  is  competing  classifi¬ 
cation  systems  to  evaluate  their  performance.  Each  system  generates  a  ROC  or  a  ROC 
curve  based  on  the  test  scenario.  The  desired  scenario  of  the  test  organization  may  be  ex¬ 
amined  under  a  range  of  assumptions  (without  actually  retesting),  and  functional  averages 
can  be  observed  as  well,  so  performance  can  be  compared  over  a  restricted  range  of  as¬ 
sumed  cost  functions  and  prior  probabilities.  The  result  is  a  sound  mathematical  approach 
to  comparing  classification  systems.  The  functional  is  scalable  to  any  finite  number  of 
classes  (the  classical  detection  problem  being  two  classes),  with  the  development  of  ROC 
manifolds  of  dimension  n  >  3.  The  functional  will  operate  on  discrete  ROC  points  in  the 
n-dimensional  ROC  space  as  well.  Ultimately,  we  will  be  able  under  certain  assumptions 
and  constraints,  to  compete  families  of  classification  systems,  fusion  rules,  fusors,  and 
fused  families  of  classification  systems  in  order  to  choose  the  best  from  among  finitely 
many  competitors. 

The  relationships  between  ROCs,  ROC  curves,  and  performance  has  been  studied  for 
some  time,  and  some  properties  are  well  known.  The  foundations  for  two-class  label  sets 
can  be  reviewed  in  [10, 14, 17,30,34,36,41,45].  The  method  of  discovery  of  these  prop- 
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erties  are  different  from  our  own.  Previously,  the  conditional  class  density  functions  were 
assumed  to  be  known,  and  differential  calculus  was  applied  to  demonstrate  certain  prop¬ 
erties.  For  example,  for  likelihood-based  classification  systems,  the  fact  that  the  slope  of 
a  ROC  curve  at  a  point  actually  is  the  likelihood  ratio  which  produces  this  point,  seems 
to  have  been  discovered  in  this  manner  [14].  Using  cost  functions  in  relation  to  ROC 
curves  to  analyze  best  performance  has  recently  (2001)  been  recognized  by  Provost  and 
Foster  [42],  based  on  work  previously  published  by  [17,  36, 48].  The  main  assumption 
in  most  of  the  cited  work,  with  regard  to  ROC  curve  properties,  is  that  the  distribution 
functions  of  the  conditional  class  densities  are  known  and  differentiable  with  respect  to 
the  likelihood  ratio  (as  a  parameter).  We  take  the  approach  that,  as  a  beginning  for  the 
theory,  we  have  ROC  manifolds  that  are  continuous  and  differentiable,  but  we  apply  vari¬ 
ational  calculus  to  a  weighted  distance  functional  on  a  specific  family  of  manifolds,  which 
has  the  effect  of  identifying  the  point  on  the  ROC  manifold  which  minimizes  Bayes  Cost. 
Under  any  particular  assumption  on  prior  probabilities  and  costs  associated  with  errors  in 
classification,  such  a  point  exists  for  every  family  of  classification  systems.  This  is  not 
to  say  the  classification  system  is  Bayes  Optimal  with  respect  to  all  possible  classifica¬ 
tion  systems,  but  rather  it  is  Bayes  optimal  with  respect  to  the  elements  of  the  family  of 
classification  systems  producing  the  ROC  manifold.  We  believe  this  functional  (which  is 
really  a  family  of  functionals  for  each  finite  number  of  classes  considered)  eliminates  the 
need  to  discuss  classification  system  performance  in  terms  of  area  under  the  ROC  curve 
(AUC),  which  is  so  prevalently  used  in  the  medical  community,  or  volume  under  the  ROC 
surface  (VUS)  [12,37],  since  these  performance  ‘metrics’  do  nothing  to  describe  a  classi¬ 
fication  system’s  value  under  a  specific  cost-prior  assumption.  Any  classification  system 
used  will  be  set  at  a  particular  threshold  (at  any  one  time),  and  so  its  performance  will 
be  measured  by  only  one  point  on  the  ROC  curve.  The  question  is  “What  threshold  will 
the  user  choose?”  We  submit  that  this  performance  can  be  calculated  very  quickly  under 
the  test  conditions  desired  (using  ROC  manifolds)  by  applying  vector  space  methods  to 
the  knowledge  revealed  by  the  calculus  of  variations  approach.  Additionally,  the  novelty 
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of  this  proposal  also  relies  on  the  fact  that  no  class  conditional  densities  are  assumed  (by 
the  tester),  and  that  the  parameters  of  the  functional  can  be  chosen  to  reflect  the  desired 
operational  assumptions  of  interest  to  the  tester.  For  example,  the  tester  could  establish 
that  Neyman-Pearson  criteria  will  form  the  data  of  the  functional,  or  maybe  to  minimize 
a  Bayes  cost  functional,  the  tester  may  wish  to  examine  performance  under  a  range  of 
hypotheses.  Once  the  data  are  established,  the  functional  will  induce  a  partial  ordering 
on  the  category  of  fusion  rules,  fusors,  and  ultimately  the  set  of  families  of  classifica¬ 
tion  systems.  This  partial  ordering  is  a  category  in  itself,  but  is  also  used  to  provide  a 
mathematical  definition  of  a  fusor,  which  is  derived  from  the  fusion  rules,  and  embodies 
mathematically  the  qualitativeness  desired  by  researchers  according  to  the  application  of 
the  problem  to  which  they  are  engaged.  In  other  words,  we  have  put  to  paper  the  defini¬ 
tion  of  what  makes  a  fusion  rule  based  classification  system  “better”  than  the  classification 
systems  from  which  it  was  derived.  An  illustrative  example  and  further  applications  of 
the  functionals,  with  consideration  of  robustness,  are  put  forth  in  the  final  section  of  this 
dissertation. 

1.2  Literature  Review 

Our  literature  review  consisted  of  three  main  areas:  information  or  data  fusion,  cat¬ 
egory  theory  with  data  fusion,  and  ROC  analysis.  We  were  interested  in  how  other  re¬ 
searchers  discussed  and  communicated  their  ideas  of  fusion,  and  in  particular,  whether 
mathematical  descriptions  of  the  overall  fusion  process  are  used  (and  not  just  a  particular 
technique).  Our  decision  to  use  category  theory  as  the  mathematical  language  prompted  a 
search  for  the  application  of  category  theory  to  the  science  of  information  fusion.  Finally, 
how  do  researchers  ensure  their  results  have  the  quality  required  to  actually  call  what  they 
are  doing  fusion?  We  decided  to  explore  the  world  of  ROC  analysis  since  every  classifi¬ 
cation  system  can  generate  at  least  one  ROC,  and  this  seemed  a  reasonable  place  to  look 
for  the  type  of  functions  (or  functionals)  which  would  be  useful  to  provide  a  definition  for 
quality  of  a  particular  fusion  rule. 
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1.2.1  Data  Fusion.  As  late  as  1999,  Dr.  Wald  in  [54]  described  the  challenges  in 
the  science  of  data  fusion,  posed  by  not  having  a  language  with  common  terms.  These 
challenges  are  readily  seen  in  the  early  results  of  the  JDL  definitions,  where  the  language 
of  what  fusion  was  consisted  of  combining,  integrating,  estimating,  predicting,  scheduling, 
optimizing,  and  more!  The  earlier  Handbook  of  Data  Fusion  [15]  had  this  definition  of 
fusion  (from  the  JDL  Data  Fusion  Lexicon): 

A  process  dealing  with  the  association,  correlation,  and  combination  of  data 
and  information  from  single  and  multiple  sources  to  achieve  refined  position 
and  identity  estimates,  and  complete  and  timely  assessments  of  situations  and 
threats,  and  their  significance.  The  process  is  characterized  by  continuous  re¬ 
finements  of  its  estimates  and  assessments,  and  the  evaluation  of  the  need  for 
additional  sources,  or  modification  of  the  process  itself,  to  achieve  improved 
results. 

This  definition  was  pruned  in  [15]  to  be: 

Data  fusion  is  the  process  of  combining  data  or  information  to  estimate  or 
predict  entity  states. 

Dr.  Wald  correctly  identified  some  of  the  problems  and  expressed  the  desire  to  have  a 
more  suitable  definition  with  the  following  principles: 

•  The  definition  should  not  be  restricted  to  data  output  from  sensors  alone; 

•  It  should  not  be  based  on  the  semantic  levels  of  the  information; 

•  It  should  not  be  restricted  to  methods  and  techniques; 

•  It  should  not  be  restricted  to  particular  system  architectures. 

He  then  went  on  to  write  a  definition,  “data  fusion  is  a  formal  framework  in  which  are 
expressed  means  and  tools  for  the  alliance  of  data  originating  from  different  sources.  It 
aims  at  obtaining  information  of  greater  quality;  the  exact  definition  of  ’greater  quality’ 
will  depend  upon  the  application.” 

Here  we  have  two  definitions,  which  are  very  close,  but  still  at  odds.  The  first  does 
not  require  a  formal  framework,  which  the  second  does,  and  also  throws  in  the  purpose  for 
the  fusion,  but  no  necessity  of  the  quality  of  the  information  (at  least  not  explicitly  stated). 
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The  second  requires  tools  for  the  alliance  of  data  from  different  sources  (without  defining 
what  is  different  about  them),  states  the  purpose  much  better,  and  allows  the  quality  of  the 
improvements  to  rest  with  the  body  of  research.  This  ’greater  quality’  is  still  not  defined. 

While  these  two  works  focus  on  definitions,  the  vast  majority  of  other  data  fusion 
papers  and  books  focus  on  the  use  of  particular  mathematical  techniques.  Each  author 
shows  the  cases  in  which  his  technique  is  optimal  (see  for  example  [6, 23, 24]),  and  com¬ 
pares  against  a  single  parameter,  such  as  probability  of  detection,  or  uses  a  ROC  curve.  In 
those  cases  where  ROC  curves  can  be  shown  to  be  dominant  in  the  compared  technique, 
the  fusion  rule  is  proven,  but  in  cases  where  ROC  curves  cross  this  comparison  is  not 
possible  without  further  elaboration  and  theory  development.  Performance  evaluation  is 
also  a  concern  in  [33],  where  the  use  of  information  measures  of  effectiveness  (MOEs)  are 
discussed.  The  focus  here  is  on  multisource-multitarget  statistics,  referred  to  as  FISST 
(finite  set  statistics).  The  use  of  information  theory  measurements  are  used,  such  as  the 
Kullback-Liebler  cross  entropy  or  discrimination.  The  use  of  these  measures  seems  to 
only  pertain  to  the  signal  level  of  the  classification  system.  In  particular,  the  Kullback- 
Liebler  discrimination  uses  the  probability  distribution  associated  with  ground  truth  and 
the  random  variable  representing  a  sensor  system.  Since  we  will  show  the  classification 
system  is  a  random  variable  made  up  sensors,  processors,  and  classifiers,  the  information 
theory  approach  is  useful  for  the  development  of  better  sensors  (and  possibly  processors). 
The  drawbacks  are  that  it  does  not  respect  Bayesian  principles,  in  that  it  does  not  allow 
for  testing  of  different  prior  probabilities  and  costs.  In  the  cases  which  they  seem  to  be  a 
good  measurement,  the  label  sets  are  simply  the  two-class  case  of  classification  systems. 
Extending  the  distribution  functions  to  a  joint  distribution  function  of  k  classes  will  prove 
to  be  very  cumbersome  to  the  researcher.  We  admit  the  connections  between  the  infor¬ 
mation  theory  measurements  and  the  classification  system  measurements  of  probabilities 
of  error  need  to  be  formally  explored,  but  this  is  beyond  the  scope  of  this  dissertation. 

1.2.2  Category  Theory  and  Fusion.  Literature  in  the  area  of  Category  Theory 
and  Fusion  is  very  limited.  There  are  only  a  few  authors  that  have  attempted  to  use 
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Figure  1.1:  The  Joint  Directors  of  Laboratories  Functional  Model  of  Data  Fusion  [15]. 

the  mathematics  in  this  sense  [7,25-27].  Each  of  these  works  relies  upon  the  use  of 
formal  systems,  or  systems  constructed  from  first  predicate  logic.  These  are  systems  a 
computer  can  understand  through  the  writing  of  software.  These  constructions  require 
that  theories  can  be  written,  using  symbolic  logic,  which  completely  describe  the  target 
classes  of  interest.  Also  required  are  models  of  the  environment,  which  incorporate  these 
theories.  The  categories  are  actually  categories  whose  objects  are  specifications  (from  the 
computer  language  Slang®  by  Specware).  Each  specification  consists  of  a  collection  of 
pairs  of  theories  and  signatures  (languages).  The  arrows  of  the  category  are  mappings 
(not  in  the  sense  of  functions)  changing  one  specification  into  another,  so  that  identities 
are  clearly  defined.  In  these  papers  fusion  is  an  “operator”  which  returns  the  colimit  of 
the  objects.  This  turns  out  to  be  the  disjoint  union  of  the  theories  and  languages.  The 
operation  is  subject  to  a  constraint  that  maintains  the  consistency  of  the  category.  We  will 
show,  as  an  example,  after  our  development  of  a  fusion  definition,  how  this  construction 
fits  into  our  view. 
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Another  set  of  interesting  papers  of  category  theory  application  has  been  written  by 
Dr.  M.  J.  Healy  [19-22]  of  Boeing.  Healy  puts  forth  the  notion  of  a  category  Neur  of 
neural  networks.  The  objects  of  this  category  are  the  nodes  of  the  neural  net,  and  the 
arrows  are  the  primed  paths  (the  identity  arrows  being  clear).  Composition  is  the  usual 
composition  of  arrows,  so  that  if  one  path  is  primed  and  a  second  is  primed  from  the  range 
of  the  first,  then  there  is  a  primed  path  from  the  domain  of  the  first  to  the  range  of  the 
second.  He  then  asserts  that  memories  are  the  colimits  of  primed  paths  in  [19].  Colimits, 
functors,  and  natural  transformations  in  a  different  category  show  the  shortcomings  of 
adaptive  resonance  theory  (ART)  networks  [22].  Colimits  again  play  a  pivotal  role  in  [20], 
which  expands  the  previous  work,  by  enabling  a  new  category  Cone  of  concepts,  which  is 
like  Kokar’s  work  in  that  it  relies  upon  theories  and  predicate  logic,  and  defining  functors 
between  Cone  and  Neur. 

With  all  these  works  pointing  towards  creating  categories  which  then  depend  on  col¬ 
imits  as  their  fusion,  is  it  then  true  that  colimits  are  the  definition  of  fusion  we’re  looking 
for?  We  don’t  think  so  based  on  the  following: 

•  while  colimits  are  optimal  in  an  algebraic  sense,  there  are  still  classification  parame¬ 
ters  to  be  considered.  For  example,  just  because  a  colimit  can  be  calculated,  doesn’t 
equate  to  the  new  classification  being  correct!  There  is  the  possibility  that  error  in 
the  original  data  has  skewed  the  colimit  to  producing  something  which  performs 
worse  than  one  of  the  systems  we  started  with,  thus  ignoring  the  desired  qualitative 
aspect  to  fusion  we’re  looking  for. 

•  the  colimits  developed  were  specific  to  formal  methods  used  in  designing  computer 
systems.  They  are  not  applicable  to  other  systems  designed  in  different  ways.  Ac¬ 
cording  to  Dr.  Wald,  then,  this  requires  a  particular  system  architecture;  therefore, 
we  cannot  define  fusion  based  on  these  cases  alone. 

1.2.3  ROC  Analysis.  Receiver  Operating  Characteristics  (ROCs)  play  a  signifi¬ 
cant  role  in  determining  the  performance  of  classification  systems.  They  have  been  used 
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extensively  in  the  medical  and  psychological  communities  with  regard  to  imaging  and  di¬ 
agnoses  [14,48].  The  definitions  of  ROCs  and  ROC  curves  and  manifolds  are  presented 
in  Section  2.2.  There  are  in  general  two  ways  to  look  at  the  analysis  of  ROCs.  The  first 
is  to  consider  an  entire  family  of  classification  systems  which  create  a  ROC  curve  or  man¬ 
ifold.  The  second  is  to  consider  that  each  classification  system  creates  a  ROC,  that  there 
are  particular  families  of  these  classifications  which  can  be  constructed  with  meaning,  and 
that  there  is  a  Bayesian  interpretation  of  their  significance  with  respect  to  the  problem  of 
classification. 

We  explore  the  first  viewpoint  by  noting  that  in  two-class  problems,  the  ROC  curve 
is  an  entity  in  two  space,  the  basis  of  which  is  two  error  axes.  Thus,  the  ROC  curve  can 
be  used  to  calculate  certain  statistical  properties  of  the  original  family  of  classification 
systems.  One  such  measurement  is  the  area  under  the  ROC  curve  (AUC).  The  area  under 
the  ROC  curve  has  been  described  in  a  couple  of  different  ways: 

•  Given  two  instances  of  data,  one  from  each  of  the  two  populations,  the  AUC  is  the 
probability  of  the  system  correctly  identifying  the  class  of  each  instance  of  data 
[14]. 

•  The  more  general  view  is  that  AUC  is  a  measure  of  how  well  a  family  of  classifica¬ 
tion  systems  separate  the  conditional  class  distribution  functions  of  the  two  classes. 

We  will  point  out  that  the  emphasis  on  the  family  of  classification  systems  is  ours.  Gen¬ 
erally,  researchers  have  regarded  these  curves  as  being  derived  from  a  single  classification 
system.  This  is  an  incorrect  view  of  the  problem  of  ROC  analysis.  It  is  recognized 
that  to  generate  a  curve  or  a  manifold,  a  parameter  (which  is  possibly  multi-dimensional) 
must  be  varied.  This  changes  the  classification  system,  so  that  it  does  not  have  the  same 
performance  as  the  original  one. 

The  AUC  is,  in  general,  the  measurement  sought  after  by  many  researchers,  and  re¬ 
searchers  have  gone  out  of  their  way  to  estimate  it  by  many  means.  These  means  include 
calculating  the  ROC  convex  hull  (ROCCH)  [41,42],  the  Mann- Whitney  test,  and  the  Gini 
coefficient  [16].  These  efforts  are  based  on  the  belief  that  the  AUC  divorces  the  problem 
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from  finding  out  the  costs  involved  in  making  the  errors  the  ROC  measures  and  in  know¬ 
ing  the  relative  ratios  of  the  classes  in  question.  In  other  words,  by  using  the  AUC  and 
the  associated  estimates,  one  does  not  need  to  concern  oneself  with  prior  probabilities  and 
the  costs  of  making  certain  errors  in  classification.  We  show  in  Section  4.3  that  this  is 
not  the  case.  When  one  believes  the  AUC  or  any  other  measure  based  on  ROCs  (such  as 
the  Neyman-Pearson  criterion)  has  divorced  the  problem  from  assuming  particular  costs 
and/or  prior  probabilities,  one  is  deceived. 

The  second  viewpoint  is  present  in  the  works  of  [2,3,41,42].  This  viewpoint  provides 
a  way  of  working  with  prior  probabilities  and  costs.  It  is  the  more  valid  viewpoint  in  our 
opinion,  based  on  our  theoretical  developments.  This  is  due  to  the  fact  that  the  problem  of 
optimizing  the  ability  to  discriminate  between  multiple  classes  is  an  optimization  problem 
with  assumptions  and  constraints,  not  just  a  statistical  problem.  One  cannot  divorce 
the  problem  from  the  inherent  prior  probabilities  and  costs  precisely  because  when  you 
establish  any  criterion  by  which  to  make  the  discriminations,  attached  to  it  is  an  underlying 
cost-prior  ratio,  which  is  now  simply  hidden,  so  that  one  cannot  escape  from  facing  the 
costs  and  the  prior  probabilities  of  the  problem.  This  is  most  clearly  laid  out  in  [40]. 

The  problem  is  certainly  expanded  when  one  considers  multiple  class  problems  (prob¬ 
lems  where  the  classes  in  question  number  greater  than  two).  A  few  papers  have  been 
written  concerning  this.  In  [16],  the  first  viewpoint  is  used,  and  statistical  estimates  are 
developed  to  compare  families  of  classification  systems.  In  our  view,  this  will  lead  to  the 
selection  of  classification  systems  that  are  suboptimal  to  the  problems  where  some  knowl¬ 
edge  regarding  costs  and  prior  probabilities  exist.  Also,  we  do  not  explore  the  inherently 
statistical  nature  of  the  work. 

In  [37],  three  classes,  all  mutually  exclusive,  are  analyzed,  and  Volume  Under  the 
Curve  (VUC)  is  explored  as  a  measure  of  how  well  a  family  of  classification  systems  per¬ 
forms.  We  have  the  same  criticisms  regarding  the  significance  of  this  measure;  however, 
the  paper  goes  further,  at  least,  in  describing  the  geometry  on  which  such  a  construct  re¬ 
lies.  There  is  no  discrimination  between  the  types  of  errors  committed,  since  the  axes 
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developed  are  each  based  on  correct  identifications,  and  not  incorrect  identifications.  We 
show  later  that  for  k  classes  there  are  k2  —  k  error  axes  required  for  a  full  ROC  manifold 
development  and  that  only  if  errors  within  types  have  identical  costs  associated  with  them 
can  we  project  the  problem  into  k  dimensions  (so  that  for  3  classes  you  need  6  dimensions 
for  a  full  ROC  manifold,  but  could  project  into  3  dimensions  only  if  the  errors  within 
classes  have  identical  costs). 

The  authors’  [41,42]  show  the  greatest  amount  of  promise  in  the  field,  by  focusing  on 
the  optimization  of  costs.  It  is  well  known  that  if  one  considers  a  ROC  curve  as  a  function, 
with  the  independent  variable  being  the  false  positive  and  the  dependent  variable  the  true 
positive,  then  under  a  mild  assumption  of  smoothness,  the  ROC  curve  is  differentiable, 
and  one  can  show  that  to  minimize  the  Bayes  Cost  function  with  two  classes,  one  needs  to 
find  the  point  on  the  ROC  curve  which  has  a  particular  cost-prior  ratio  as  a  derivative.  The 
only  paper  we  found  with  a  “proof”  of  this  was  [36],  in  which  he  claims  the  result  can  be 
shown.  His  own  analysis  fails  to  prove  the  achieved  critical  points  are  always  a  minimum. 
In  fact,  since  the  second  derivative  test  is  inconclusive,  one  must  use  the  first  derivative 
test  to  prove  the  minimum.  The  first  derivative  test  is  not  available  to  us  in  the  case  of 
multivariate  problems.  We  use  calculus  of  variations  and  the  global  optimization  theory 
of  vector  space  methods  to  prove  this  not  just  for  ROC  curves,  but  we  have  extended 
the  method  to  prove  it  for  ROC  manifolds,  so  that  problems  of  multiple  classes  can  be 
analyzed  using  the  Bayesian  methods.  Our  method  involves  only  the  geometry  of  the  ROC 
manifold  in  ROC  space,  along  with  the  same  cost  function  (examined  as  the  functional  J 
in  Chapter  IV). 

Much  use  is  made  in  [41,42]  of  the  ROCCH.  The  usefulness  of  the  ROCCH  is  ap¬ 
parent  when  one  considers  creating  randomized  decision  rules  from  previously  created 
families  of  classification  systems.  When  two  classification  systems  overlap,  one  can  con¬ 
sider  the  ROCCH  as  a  solution  to  which  classification  system  to  choose,  since  the  ROCCH 
can  be  created  under  a  convex  combination  of  selecting  probabilistically  one  family  or  the 
other.  We  show,  however,  that  with  respect  to  a  particular  cost-prior  assumption,  no  ben- 
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efit  is  gained  by  doing  this,  since  under  these  assumptions  no  greater  cost  benefit  can  be 
achieved  along  the  extension  of  the  convex  hull  than  there  is  at  the  endpoints  (which  al¬ 
ready  exist).  Under  the  first  viewpoint,  there  is  a  benefit.  That  is  the  direct  increase  in 
the  area  under  the  curve,  so  that  by  randomly  selecting  the  family  from  which  to  choose, 
one  may  increase  the  overall  ability  to  separate  the  conditional  class  distribution  functions, 
associated  with  the  classification  families.  Similarly,  in  [12],  the  search  is  on  to  construct 
the  convex  polytope  associated  with  three  class  (and  n  class)  problems.  The  authors  use 
the  “trivial”  classification  systems  to  construct  the  best  estimate  of  the  convex  polytope, 
but  they  also  recognize  the  efforts  of  [37]  and  [16]  in  their  approach. 

ROCs  also  have  an  inherent  application  to  detection  problems  involving  electronics 
(thus  the  “receiver”  in  receiver  operating  characteristics).  This  history  and  analysis  can 
be  found  in  textbooks,  particularly  in  [10,  30,  34, 45].  The  emphasis  in  these  texts  are 
towards  the  development  of  classification  systems  and  not  the  performance  evaluation 
from  ROCs  only.  In  some  respects  the  developments  in  these  texts  overlap  with  our 
development,  but  from  the  opposite  approach.  There  are  also  some  errors  made  in  the 
texts,  which  are  not  apparent  until  you  really  dive  into  some  of  the  analysis  with  respect 
to  risk  sets  (particularly  the  min-max  example  in  [45]).  We  need  to  point  out  two  things 
with  respect  to  this.  First,  our  optimization  is  significantly  different  in  its  characterization 
of  the  problem.  We  use  calculus  of  variations  and  properties  of  linear  transformations  to 
establish  our  optimization  problem  and  we  pick  up  on  some  differences  that  we  feel  are 
very  important.  Secondly,  there  is  no  connection  to  information  fusion  or  category  theory 
given  in  these  texts.  So  our  application  is  certainly  new  and  independent  and  extends 
the  field  of  knowledge.  Overall,  we  believe  our  review  to  be  sufficient  to  look  into  the 
use  of  ROCs,  ROC  curves,  and  ROC  manifolds  in  order  to  produce  a  theory  which  is 
satisfactory  to  discriminating  the  performance  of  one  classification  system  over  another, 
or  one  family  of  classification  systems  over  another.  If  an  objective  function  can  be 
produced  on  ROCs,  ROC  curves,  and  ROC  manifolds,  then  we  can  define  the  qualitative 
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nature  of  fusion  according  to  each  application  (that  is,  which  fusion  rules  are  better  than 
the  original  classification  systems,  and  which  fusion  rules  are  superior  to  others). 


14 


II.  Background 

2.1  Mathematical  Formalisms  and  Definitions 

This  work  is  inherently  an  applied  math  dissertation.  As  such,  the  background  ma¬ 
terial  required  in  order  to  understand  it  is  drawn  from  the  areas  of  Topology,  Category 
Theory,  Probability  Theory  (measure-theoretic  in  scope),  and  some  vector  space  knowl¬ 
edge.  We  assume  a  basic  knowledge  of  vector  spaces  is  understood  by  the  reader,  but 
certain  useful  definitions  and  theorems  are  stated  concisely  in  this  section  to  facilitate  the 
readers  understanding.  Definitions  of  receiver  operating  characteristics  (ROCs),  ROC 
curves,  and  ROC  manifolds  are  also  given,  along  with  a  couple  of  convergence  theorems 
useful  to  understanding  the  context  of  why  ROC  analysis  has  drawn  such  attention  from 
researchers. 

2.1.1  Topology  Formalisms  and  Definitions. 

Definition  1  (Preimage).  Let  /  be  a  function  with  X  the  domain  of  /  and  Y  the  range. 
Then  given  B  <zY ,  we  denote  the  preimage  of  B  over  f  by  f\B),  where 

f\B)  =  {xeX  :  f(x)eBcY}.  (2.1) 

The  symbol  tj  is  the  natural  symbol  from  music  literature  (also  known  as  the  becuadro)  and 
is  used  precisely  because  we  do  not  want  to  confuse  the  preimage  of  a  set  over  a  function 
with  the  inverse  of  the  function,  which  is  denoted  as  f~l. 

Definition  2  (Topology,  Topological  Space  [38]).  A  topology  r  on  a  set  X  is  a  collection 
of  subsets  of  X  such  that: 


i.  X,  0  e  r. 

ii.  Arbitrary  collections  of  sets  of  r  have  their  unions  in  r. 

iii.  Finite  collections  of  sets  of  r  have  their  intersections  in  r. 
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The  sets  contained  in  r  are  called  the  open  sets  of  X.  We  say  (A",  r)  is  a  topological 
space. 

Example  1.  Here  is  an  example  from  [38].  Given  a  set  X,  with  an  order  relation  <,  and 
a,  b  G  A",  the  following  types  of  sets  are  in  the  topology: 

1.  (a,  b)  —  {x  |  a  <  x  <  b}; 

2.  [o0,  b)  =  {x  j  a0  <  x  <  b},  where  a0  is  the  smallest  element  of  X  (if  one  exists); 

3.  (a,  bo]  =  {x  j  a  <  x  <  60}>  where  b0  is  the  largest  element  of  X  (if  one  exists); 
The  collection  SS  of  such  sets  for  all  a,  b  &  A  is  the  order  topology  on  X. 

Definition  3  (Hausdorff  Space).  A  topological  space  (A",  r)  is  a  Hausdorff  space  if  for 
any  two  elements  x^x2  €  X  with  x\  ^  x2,  there  exists  open  sets  U,V  G  r  such  that 
xi  G  U  and  x2  G  V,  with  U  D  V  —  0. 

Example  2.  The  set  of  real  numbers,  M,  with  the  Euclidean  metric  of  distance  between 
two  points,  is  an  example  of  an  Hausdorff  Space. 

Definition  4  (Metric,  Metric  Space).  Let  X  be  a  set.  Then  for  x,  y,  z  e  X,  if  there  exists 
a  function  d,  such  that  d  :  X  x  X  — >  M+,  which  satisfies  the  conditions: 

i.  d(x,y)  >  0  (non-negativity); 

ii.  d(x,  y)  =  0  iff  x  =  y  (positive  definiteness); 

Hi.  d(x,y)  =  d(y,x)  (symmetry); 

iv.  d(x,  y)  <  d(x ,  z)  +  d(z,  y)  (triangle  inequality); 
then  d  is  a  metric.  We  call  (A",  d)  a  metric  space,  though  the  notation  is  often  suppressed 
to  simply  X. 

Example  3.  The  Euclidean  metric  of  distance,  given  x,y  e  X,  d(x,y)  =  \x  —  y\  is  a 
metric.  Every  Euclidean  metric  induces  a  topology  as  well,  since  open  sets  can  be  defined 
in  terms  of  Euclidean  distance,  and  a  basis  for  such  topologies  can  easily  be  formed  using 
open  balls  in  X. 
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Definition  5  (Open  Ball  in  Rn).  An  open  ball  in  M"  relative  to  a  metric  d  is  written  B(x:  r) 
where  there  is  no  misunderstanding  of  the  metric.  The  meaning  of  the  open  ball  is 

B(x;r)  =  {y  G  M"  |  d(x,y )  <  r}, 

where  x  e  Mn  is  the  center  of  the  ball  with  r  its  radius. 

Example  4.  Let  (M,  d)  be  a  metric  space.  Then  for  £  >  0  and 

B{x;e)  =  {y  :  d(x,y)  <  e} 


forms  an  open  ball  in  M. 

Definition  6  (m-Manifold  [38]).  Let  me  N  be  given.  A  topological  space  (. X ,  r)  is  an 
m-Manifold  if  it  is  a  Hausdorff  space  and  has  a  countable  basis  such  that  each  neighbor¬ 
hood  of  a  point  x\  e  A"  is  homeomorphic  with  an  open  subset  in  Mm. 

Example  5.  In  M"  ,  n  >  3,  a  1-manifold  is  a  curve,  a  2-manifold  is  a  surface,  etc. 

2.1.2  Probability  Theory  Formalisms  and  Definitions.  Necessary  to  reading  this 
dissertation  is  a  common  frame  of  reference  with  regard  to  category  theory  and  probability 
theory.  We  will  start  with  the  latter  and  the  reader  can  always  familiarize  himself  with  [5] 
for  probability  theory,  or  [43,44]  for  elementary  measure  theory. 

Definition  7  (Algebra  or  Field  of  Sets).  Let  X  be  an  arbitrary  set.  A  collection  SB  of 
subsets  of  X  is  an  algebra  or  a  field  if  it  satisfies  three  properties: 

i.  X  e  SB. 

ii.  For  any  B  e  SB,  we  have  the  set  complement,  X\B,  written  B.  also  in  £$. 

n 

Hi.  Given  the  finite  collection  {Bi  e  9B  :  i  =  1,  2, . . . ,  n  E  N},  then  |^J  If  e  SB. 

i= 1 

Definition  8  (cr-algebra  or  rr-field).  Let  X  be  an  arbitrary  set.  A  collection  of  subsets, 
SB,  of  X  is  a  cr-algebra,  or  cr-field,  on  X  if  SB  satisfies  three  properties: 


17 


i.  X  e  SB. 


ii.  For  any  B  £  SB,  we  have  B  is  also  in  SB. 

OO 

Hi.  Given  the  countably  infinite  collection  { B,  e  SB  :  i  —  1,2,...},  then  |^J  B,  e  SB. 

i=  1 

We  can  see  that  a  cr-field  is  a  field  of  sets  as  well. 

Example  6.  Given  a  set  X,  the  power  set  of  X,  S?( X),  is  a  cr-field. 

Definition  9  (Positive  Measure).  Let  X  be  a  set  and  SB  be  a  cr-field  over  A",  then  any 
set  function  v  defined  on  SB  with  range  M[0,  oo]  is  called  a  positive  measure  on  X  if  it  is 
countably  additive.  That  is,  given  a  disjoint,  countable  collection,  {Bj}^,  of  sets  in  SB, 
then 


Example  7.  See  [43]  for  the  definition  of  outer  measure  on  M.  Outer  measure  returns 
lengths  of  intervals  on  M,  so  that  outer  measure  is  a  positive  measure  on  M. 

Definition  10  (Sample  Space).  Given  a  complex  T  of  conditions,  which  allows  any  num¬ 
ber  of  repetitions  (an  experiment,  for  example),  there  is  a  collection  of  elementary  events, 
£,  £, . . .,  not  necessarily  countable,  which  is  called  the  sample  space,  and  will  be  de¬ 

noted  as  f)  [28]. 

Example  8.  An  example  of  a  complex  of  conditions  T  is  “the  tossing  of  a  coin”,  while 
the  sample  space  f)  =  {h,t},  where  h  is  the  event  of  getting  a  “head”  as  a  result,  and  t  the 
event  of  getting  a  “tail”.  If  T  is  “the  tossing  of  a  coin  two  times”,  then 
Q  =  {h,  h.  h,  1. 1.  h  .  1. 1}  is  the  sample  space.  If  the  event  T  implies  that  a  tail  results,  then 
any  of  the  last  three  elementary  events  of  Q  has  occurred.  The  complex  of  conditions  will 
usually  be  described  using  language  in  order  to  identify  meaning.  This  language  then 
leads  to  the  formation  of  the  sample  space  and  the  cr-field  so  that  a  probability  measure 
can  be  defined.  All  probabilistic  mathematical  language  for  the  given  problem  flows  from 
this  beginning. 
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A  a-field,  38,  can  be  developed  on  f),  so  that  the  pair  (f) ,  38)  is  a  measurable  space. 
Given  a  positive  measure  //  on  (f),  38)  ,  the  triple  (f),  38,  //)  is  called  a  measure  space. 

Definition  11  (Measurable  Function).  Let  (A",  38)  be  a  measurable  space  and  (Y.  r)  be 
a  topological  space.  A  function  /  is  called  measurable  if  for  each  E  E  r,  we  have 
that  the  preimage  of  E  under  /,  denoted  f\E),  is  also  an  element  of  38.  We  call  /  a 
^-measurable  function. 

Example  9.  Let  //  be  Lesbegue  measure  on  M.  Consider  any  continuous  function  /  with 
compact  support.  Since  the  preimage  of  an  open  set  is  open  for  continuous  functions, 
and  open  sets  are  always  contained  in  the  Borel  rr- field,  we  have  that  these  functions  are 
measurable. 

Definition  12  (Finite,  cr-Finite  Measures).  If  //(f))  <  oo,  then  //  is  a  finite  measure. 
A  measure  //  is  rr-finite  if  there  exists  a  sequence  { Bn }  of  elements  of  38  such  that 

OO 

Q=\J  Bn  and  //(/i„)  <  oo  for  each  n  E  N.  Finite  measures  are  clearly  a- finite  as  well. 

n=  1 

Example  10.  Lesbegue  measure  //  on  M  is  an  example  of  a  a-finite  measure.  Consider 
the  countable  balls  with  radius  e  >  0  and  centers  x  G  Q.  The  union  of  these  balls  is  M, 
while  the  measure  of  each  ball  is  finite. 

Definition  13  (Probability  Measure,  Probability  Space  [49]).  Given  a  measurable  space 
(f),  38),  a  positive  measure  //  with  //(f))  =  1  is  defined  as  a  probability  measure.  A  prob¬ 
ability  measure  is  a  finite  measure  and  therefore  a  rr- finite  measure  as  well.  The  measure 
space  (f),  38,  //)  is  called  a  probability  space. 

Notice  how  the  properties  of  the  definitions  flow: 

Since  Q  E  38,  then  0  G  38.  Therefore,  since  //(f))  =  //(f)  U  0),  we  have 
by  the  definition  of  a  positive  measure  the  property  of  countable  additivity,  so 
that  //(f))  =  //(f))  +  //(0).  Since  //(f))  =  1,  we  have  that 

1  =  1  +  //(0), 

so  that  //(0)  =  0.  Furthermore,  for  any  B  E  38,  we  have  0  <  fJ,(B)  <  1,  and 
since  //(f))  =  /z(-B)  +  n(B)  we  have  that 

1  ~KB)  =  3(B). 


19 


Definition  14  (Bayes  Theorem).  Given  a  probability  space  (Q,  08,  /j),  and  Bi,  B-2  G 
the  conditional  probability  of  /i  i  given  B2  is  written 


P(5i|52) 


fj>{B  i  n  b2) 
h{b2) 


With  this  in  mind,  Bayes  Theorem  states 


P{B! \B2)h{B2)  =  P(B2\B1)fj,(B1),  (2.2) 

so  that 

P(B1\B2)  =  ^-P(B2\B1).  (2.3) 

h(b2) 

The  notation  is  written  to  emphasize  that  P  is  not  a  measure,  but  rather  given  an  event  B2, 
then  P(-\B2)  is  a  measure  which  is  related  to  the  measure  //  by  the  definition.  We  refer  to 
the  left  hand  side  of  the  equation  as  the  conditional  probability  of  the  event  Bi  given  event 
B2  has  occurred,  while  the  conditional  probability  on  the  right-hand  side  of  the  equation 
is  referred  to  as  the  posterior  probability.  Each  real  number  /i(-E>i)  and  fi(B2)  is  a  prior 
probability. 

Definition  15  (Random  Variable).  Given  a  probability  space  (Q,  08,  ji)  and  a  measurable 
space  (Cl,  08),  we  say  /  :  Q  — >  Cl  is  a  random  variable  if  f\E)  G  08  for  each  E  G  08. 
We  say  /  is  an  fVvalued  random  variable. 

Note:  It  is  true  that  a  (b-valued,  measurable  function,  g  :  f l  — >  <h,  is  a  ^-valued  ran¬ 
dom  variable  for  any  topological  space  (<f>,  r),  since  there  always  exists  a  smallest  cr-field 
containing  the  elements  of  r  [44].  The  specific  language  “random  variable”,  without  the 
hyphenated  prefix,  is  reserved  for  the  case  when  $  =  M. 

Definition  16  (Stochastic  Process  [9]).  Let  (Q,  08,  //)  be  a  probability  space  and  0  a  set 
of  parameters  which  may  be  finite,  countably  infinite,  or  uncountable.  Then  a  family  of 
random  variables  indexed  by  0,  X  =  {Xe  :  6  G  0}  is  a  stochastic  process.  If  0  is  count¬ 
ably  infinite  or  finite,  then  X  is  a  discrete  parameter  process.  If  0  is  a  continuous 
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parameter,  then  X  is  a  continuous  parameter  process.  If  we  fix  u  G  and  allow  9 
to  vary,  then  the  function  X.  (u)  is  a  sample  function  when  0  is  uncountable.  When  0  is 
countable  or  finite,  then  X.{u)  is  a  sample  sequence. 

2.1.3  Category  Theory.  This  section  draws  upon  definitions  contained  in  [53]. 
Category  theory  is  a  branch  of  mathematics  useful  for  determining  universal  properties 
of  objects.  The  science  of  information  fusion  does  not  yet  know  of  all  the  relationships 
involved  between  the  classes  of  data  and  the  mappings  from  one  type  of  data  to  another. 
It  has  been  our  goal  to  try  to  engage  the  community  to  think  in  terms  of  generalities  when 
studying  fusion  processes  in  order  to  abstract  the  processes  and  perhaps  gain  some  clarity 
of  thought,  if  not  genuine  insight.  We  have  drawn  upon  the  work  of  various  authors  in 
Category  Theory  literature  [1,29,32,35]  to  present  the  definitions. 

Definition  17  (Category).  A  category  C  is  denoted  as  a  4-tuple, 

C=  (Ob(C),Ar(C).Id(C),o), 
and  consists  of  the  following: 

Al.  A  class  of  objects  denoted  Ob(C),  so  object  O  G  Ob(C). 

A2.  A  class  of  arrows  denoted  Ar(C),  so  arrow  /  G  Ar(C). 

A3.  Two  mappings,  called  Domain  (dom)  and  Codomain  (cod),  which  assign  to  an  ar¬ 
row  /  G  Ar(C)  a  domain  and  codomain  from  the  objects  of  Ob(C).  Thus,  for 
arrow  /  G  Ar(C),  there  exist  objects  0\  =  dom (/)  and  02  =  cod (/)  and  we 
represent  the  arrow  /  by  the  diagram 

Oi  —  02. 

A4.  A  mapping  assigning  each  object  O  G  Ob(C)  an  unique  arrow  1Q  G  Id(C)  called 
the  identity  arrow,  such  that 

O^O 
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and  such  that  for  any  existing  element,  x,  of  O,  we  have  that 


l  o 

x  i - 5-  x. 

A5.  A  binary  mapping,  o  ,  called  composition,  Ar(C)  x  Ar(C)  -  >  Ar(C)  .  Thus, 
given  /,  g  e  Ar(C)  with  cod(/)  =  dom (g)  there  exists  an  unique  h  e  Ar(C)  such 
that  h  =  g  o  f. 

Axioms  A3-A5  lead  to  the  associative  and  identity  rules: 

•  Associative  Rule.  Given  appropriately  defined  arrows  f,g,  and  h  e  Ar(C)  we 
have  that 

if°g)  °  h  =  fo  (go  h). 

•  Identity  Rule.  Given  arrows  A  — — -  B  and  B  — - —  A  ,  then  there  exists  arrow 
1  a  G  Id(C)  such  that  1  a°  g  =  g  and  /  o  1A  —  f. 

Definition  18  (Subcategory).  A  subcategory  B  of  A  is  a  category  whose  objects  are  some 
of  the  objects  of  A,  i.e.,  Ob(£>)  C  Ob(A),  and  whose  arrows  are  some  of  the  arrows  of 
A,  i.e.,  Ar(£>)  C  Ar(A),  such  that  for  each  arrow  /  e  Ar (B),  dom (/)  and  cod (/)  are  in 
Ob(£>),  along  with  each  composition  of  arrows,  and  an  identity  arrow  for  each  element  of 
Ob  (B). 

Definition  19  (Discrete  Category).  A  discrete  category  is  a  category  whose  only  arrows 
are  identity  arrows,  i.e.,  Ar(C)  =  Id(C)  . 

Definition  20  (Small  Category).  A  category  C  is  called  a  Small  Category  when  the  class 
Ob(C)  is  a  set. 

Note:  A  historical  note  on  this  is  that  while  in  this  paper  and  in  many  works,  the  only 
categories  considered  are  small  categories,  category  theorists  are  proposing  an  axiomatic 
replacement  for  set  theory  as  a  mathematical  foundation.  In  other  words,  all  mathematical 
properties  can  be  shown  using  an  axiomatic  category  theory  rather  than  the  Zermelo- 
Fraenkel  axioms  for  set  theory.  The  belief  is  that  the  category  theory  approach  will  avoid 
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certain  paradoxes  which  creep  up  in  set  theory,  such  as  “the  set  whose  members  are  not  in 
a  set”. 

Definition  21  (Functor).  A  functor  $  between  two  categories  A  and  B  is  a  pair  of  maps 

(3obi  T?Ar) 

Ob(^)  Ob(£>) 

Ar(^4)  Ar(£>) 

such  that  5  maps  Ob(^l)  to  Ob(£?)  and  Ar(A)  to  Ar(£>)  while  preserving  the  associative 
property  of  the  composition  map  and  preserving  identity  maps. 

Thus,  given  categories  A,  B  and  functor  g  :  A - B  ,  if  A  G  Ob  (.4.)  and  /,  g,  h,  1a  G 

Ar(*4.)  such  that  /  o  g  =  h  is  defined,  then  there  exists  B  G  Ob (B)  and  f,  g',  h 1b  G 
Ar(£>)  such  that 

0  5ob(A)  =  B. 

it)  $Ar  (/)  =  /',  $Ar{g)  =  rf . 

Hi)  b!  =  $Ar(h)  =  $Ar(f  O  g)  =  $Ar{f)  °  ^Ar  (g)  =  f  °  f/ ■ 

IV)  3ai-(1a)  =  lfob(A)  =  1b- 

We  denote  a  functor  J  between  categories  A  and  B  with  the  diagram 

A^B- 

Example  11.  An  elementary  example  of  a  functor  is  the  forgetful  functor.  Let  GRP  be 
the  category  of  groups  which  has  as  objects  groups  and  as  arrows  morphisms  between 
groups.  Let  (G)  denote  the  underlying  set  of  elements  of  a  given  group  G.  Then  the 
forgetful  functor,  5,  maps  groups  to  their  underlying  sets,  and  all  arrows  to  the  identity 
arrow  on  the  underlying  set. 

Definition  22  (Natural  Transformation).  Given  categories  A  and  B  and  functors  5  and 
0  with  A  >-  B  and  A - >■  B  ,  then  a  Natural  Transformation  is  a  family  of  arrows 
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v  =  {uA  :  A  G  Ob  (A)}  such  that  for  each  /  <G  Ar(v4),  A  — ^  A' ,  A'  G  Ob(A),  the 
square  diagram 

A  ff(A)-^0(A) 

/  3(f)  ®(/) 

A'  g'(A')  — ^  0(A') 

commutes.  We  say  the  arrows  vA  are  the  components  of 

v\  5 - -0, 

and  call  v  the  natural  transformation  of  ^  to  0. 

Example  12.  This  example  is  from  [32].  Let  CRng  be  the  category  of  commutative 
rings,  and  GL„  (-)  be  the  category  of  general  linear  groups,  which  consists  of  all  n  x  n 
invertible  matrices  over  commutative  ring  (•).  The  determinant  of  the  matrices  is  a  natural 
transformation  (since  the  matrices  are  calculated  with  the  same  formula  regardless  of  the 
ring  used)  making  the  following  square  commute  (K*,K'*  are  rings  with  their  additive 
identity  removed,  so  that  all  of  the  elements  are  invertible,  and  therefore  they  are  objects 
of  the  category  GRP.): 

K  GLn(K)  — — ^ - ^K* 

f  gl  „(/)  /* 

K'  GL  n(K')-^- - *K’* 

This  says  that  for  every  morphism,  /  of  commutative  rings,  the  determinant  is  natural 
among  functors  CRng  — >  GRP. 

Definition  23  (Functor  Category  AB).  Given  categories  A  and  B.  the  notation  AB  rep- 
resents  the  category  of  all  functors  5  such  that  B  — — >■  A  ■  This  category  has  all  such 
functors  as  objects  and  the  natural  transformations  between  them  as  arrows.  We  can  also 
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have  that  given  objects  A  G  OB  (*4.)  and  B  G  OB  (73),  there  exists  functor  categories 
denoted  as  AB,  AB,  and  AB  as  well. 

Definition  24  (Product  Category).  Let  {C}"=1  be  a  finite  collection  of  small  categories. 
Then  the  cartesian  product 

n 

Y[Ci  =  CL  x  C2  x  ■  ■  ■  x  Cn 

i=  1 


forms  a  category  called  the  product  category.  For  each  O  G  Ob  (  C,  ) ,  then 

'  i= 1  ' 

O  =  (Oi,  02, . . . ,  On )  where  O*  G  Ob(Cj)  for  i  =  1, 2, ... ,  n.  For  each  arrow  f  G 
Ar  ( II >  then  f  =  (/i,  f2, . . . ,  /„)  where  ft  G  Ar(C,:)  for  i  =  1,2,...,  n.  Given 


i=  1 


i=  1 


arrows  f,g  e  Ar  |  n  C,  j ,  then  the  composition  of  these  arrows  mean 


f  °  9  =  (/l  O  9l,  /2  O  g2,  ■  ■  ■  ,  fn  O  fin ) 


2. 1.3.1  Category  Examples.  Some  examples  of  categories  are: 

Example  13.  The  category  of  all  Abelian  Groups,  Ab.  Here  the  objects  are  abelian  groups 
and  the  arrows  are  all  morphisms  from  one  Abelian  Group  to  another. 

Example  14.  The  category  Ban  of  Banach  Spaces.  Here  the  objects  are  Banach  Spaces, 
and  the  arrows  are  all  bounded  linear  transformations  between  them. 

Example  15.  The  category  VectK  of  finite-dimensional  Vector  Spaces  over  the  field  K. 
The  objects  are  finite  vector  spaces  and  the  arrows  are  all  linear  transformations  between 
them. 

Example  16.  The  category  SET,  a  small  category  whose  objects  are  sets  and  arrows  are 
the  total  functions  between  them. 

Example  17.  The  category  CAT,  a  small  category  whose  objects  are  small  categories 
and  arrows  are  the  functors  between  them. 
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Some  examples  of  functors  are: 


Example  18.  5  :  Ab  — >  SET,  which  is  the  forgetful  functor  which  simply  maps  all  non¬ 
identity  arrows  in  the  category  Ab  which  map  from  an  object  to  the  identity  arrow  of  that 
object,  now  considered  as  a  set  only  within  the  category  SET,  rather  than  a  group. 

Example  19.  (25  :  Ban(X)  — >  Set,  the  functor  mapping  all  subspaces  of  a  Banach  space 
X  to  their  respective  subsets.  Non-identity  arrows  are  mapped  to  identity  arrows,  so  this 
functor  is  also  a  ’’forgetful”  functor. 


2.2  Receiver  Operating  Characteristic  (ROC)  Background 

2.2.1  Definition  of  ROC  curve.  Let  (D,  08,  //)  be  a  probability  space,  L  be  a  two- 
class  label  set,  L  =  {fi,f2},  and  let  X(t,-)  :  D  — >  L  be  a  discrete  random  variable 
indexed  by  a  parameter  set  T,  where  t  G  T  is  a  parameter,  and  T  might  be  uncountable 
and  multidimensional.  We  will  refer  to  the  sample  function  Xt(-)  =  X(t,  •)  as  a  classifier 
of  members  of  08.  Usually,  T  is  homeomorphic  to  some  subset  of  for  some  m  e  N. 
We  assume  Q  can  be  partitioned  into  two  sets  of  events,  so  O  =  U|  U  fi2,  where  the  first 
set  Ux  corresponds  to  the  label  £i,  and  the  second  to  the  label  f2.  Thus,  fl  02  =  0 
is  assumed.  Under  the  assumption  of  only  two  labels,  we  will  assume  that  T  is  a  one¬ 
dimensional  parameter  space. 

Each  classifier  Xt  can  make  a  mistake  in  classification.  There  are  two  types  of  errors 
it  can  make.  It  can  assign  objects  in  class  1  to  label  f2,  or  it  can  assign  objects  in  class  2 
to  label  t]_.  Let  X[(f.f  denote  the  pre-image  of  the  label  (:r  under  classifier  X(t.  •).  We 
can  construct  the  two  conditional  probabilities  of  a  classifier  making  these  errors  as 


Vl\2  {t)  =  P(£ i  |  fi2) 


p(x^(h)  no2) 

/r(U2) 


(2.4) 


and 


P2\l(t)  =  P(£ 2  |  ^i) 


/r(Ui) 


(2.5) 


26 


where  //(f22)  and  pi^h)  are  the  prior  probabilities  of  their  respective  events  and  the  Pi\j(t) 
for  i,j  —  1,2  are  the  conditional  class  probabilities  of  classifying  an  event  as  C  when 
event  £j  has  occured.  Two  conjuctive,  conditional  class  probabilities,  constructed  in  the 
same  manner,  form  the  following  relationships  [11]: 


Pl\2(t)  +P2\2{t)  =  1 

(2.6) 

P2\l(t)  +Pl|l(f)  =  1 

(2.7) 

For  a  specific  t  e  T,  the  ordered  pair,  (pi|2(f),pi|i  (t))  is  called  the  receiver  operating 
characteristic  (ROC)  of  classifier  X{t,  •),  when  the  dependent  class  is  0] .  We  will  use 
the  notation  G°i|2(t),p2|i(£))  as  the  ROC,  however,  to  better  accommodate  our  description 
of  the  n-class  problem.  A  set  X  =  {X(t,  •)  :  t  e  T}  is  called  a  family  of  classification 
systems  (alternatively,  a  classifier  family).  We  say  the  set  of  triples  formed  by  X, 

h  =  {{t,Pi\2{t),P2\i{t))  ■  t  e  T,fl  =  fii  U  02} 

forms  a  ROC  trajectory,  when  it  is  lower  semi-continuous  and  monotonic,  non-increasing 
(see  [9]  regarding  distribution  function  properties  for  comparison).  We  say  the  set 

fx  =  {(Pi\2(t),P2\i(t))  ■  te  T,0  =  01U02},  (2.8) 

which  is  the  projection  of  /x  from  the  space  T  x  M[0,  l]2  into  the  space  M[0, 1] 2,  is  the  ROC 
curve  of  family  of  classification  systems  X,  when  its  closure  has  endpoints  in  the  compact 
interval  M[0, 1],  and  it  is  lower  semi-continuous,  and  monotonic  non-increasing.  We  also 
call  the  ROC  curve  the  ROC  manifold  (technically,  a  ROC  1-manifold,  see  Lemma  2.2.2), 
since  this  curve  is  homeomorphic  to  M1  for  every  open  ball  on  the  curve  and  it  is  a  Haus- 
dorff  space. 
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Figure  2.1:  A  Typical  ROC  Curve  from  Two  Normal  Distributions 

Definition  25  (Proper  ROC  Curve  [4]).  Given  a  metric  space  (T,  d)  and  a  (one-dimensional) 
parameter  t  E  T,  a  continuous  ROC  curve  /x  as  defined  in  Equation  2.8  is  called  a  proper 
ROC  curve  when 

1.  lim(p2|i(t),Pi|2(*))  =  (0, 1). 

2.  lim  (p2|i(f),Pi|2(f))  =  (1,0). 

Typically,  ROC  curves  are  graphed  using  (p2\i(t) ,  pi\i(t))  as  coordinate  pairs  ,  producing 
a  curve  from  (0,  0)  to  (1, 1).  For  multi-class  problems  (greater  than  two  classes),  this  is 
not  the  best  visualization  scheme  to  follow. 

2.2.2  ROC  Space.  Many  publications  refer  to  the  real  set  product  M([0, 1])  x 
M([0, 1])  as  ROC  space.  This  terminology  is  unfortunate  since  M([0, 1])  is  not  a  ‘space’ 
in  the  sense  of  a  linear  space.  We  clarify  here  that  by  the  term  ROC  space  we  mean 
the  topological  subspace  (M2([0, 1]),  r)  of  (M2,  r)  where  r  is  the  Euclidean  topology  (the 
topology  induced  by  the  usual  distance  metric). 

Lemma  1  (ROC  1-Manifold).  A  proper  ROC  curve  is  a  1-manifold  in  ROC  space. 
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Proof:  LetS  =  {(P2\i(Ae),  Pi\2{Aq))  :  0  6  0,fi  =  Oi  U  fl2,  Ae  G  A}  be  aproperROC 
curve,  with  {Oi,  02}  a  partition  of  Q  into  two  classes.  Let  x(0)  =  P2\i(A9),  y(9 )  = 
Pi\2(Ae),  and  let  x  =  x(9),y  =  y{9)  for  brevity  of  notation.  Let  £  >  0  be  given.  The 
norm  ||  •  ||  is  Euclidean  2-norm.  An  open  set  A  in  S  is  open  relative  to  the  usual  M2  topol¬ 
ogy.  There  is  a  countable  basis  for  this  topology  which  consists  of  the  open  balls  of 
rational  radius  r  about  each  coordinate  point  with  rational  first  component.  To  show  S  is 
Hausdorff  let  (x,  y)  and  (w,  z )  be  two  distinct  points  in  S.  Then  we  have  that 

II  (x,y)  -  (w,z)  ||  =  S, 

for  some  5  G  M.  Let  7  =  f  •  Thus  we  have  that 

%,!/);7)n%7);7)  =  0 

are  two  intersecting  open  sets  containing  the  two  distinct  points. 

Now  let  (x,  y)  E  S  be  given.  Define  a  function  g  :  B{ix,y)\  e)  — >  B(x;  e)  C  M1  by 

g[{x,y)}  =  x,V  (x,y)  G  B((x,y)-,e), 

where  B  (•;  e)  is  an  open  ball  of  radius  e  with  center  •.  Clearly,  g  is  one-to-one,  since  for 
z  G  M  such  that  x  =  z  we  have  that  there  exists  y2  G  M  with  g[(z,y2)\  =  z.  Thus,  if 
(x,  y)  f  (z,  y2),  then  either  x  z  (which  is  a  contradiction)  or  y  y2.  Suppose  y  f  y2. 
Then  S  is  not  a  set  representation  of  a  function,  which  is  a  contradiction,  since  this  is 
implicit  in  the  definition  of  S.  Therefore,  (x,  y)  =  (z.  y2),  and  g  is  one-to-one.  Thus,  g 
has  an  inverse,  g~l. 

Now,  let  £  >  0  be  given,  with  e  >  £  >  0.  Then  for  (x2,  y2)  G  P((x,  y);e)  such  that 

II (x,y)  -  {x2,y2)\\  <  £ 
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we  have  that 


II 9[{x,y))  -g[(x 2, 


=  F-^2|| 

<  V(x  -  xz)2  +  (y  -  y-i)2 
=  \\{x  -  x2,y -  y2)\\ 

=  \\(x,y)-(x2,y2)\\ 

<  e 


(2.9) 


so  that  g  is  continuous  as  well.  Since  g  is  continuous  over  every  compact  subset  of 
B(ix,y)\e),  g~l  is  continuous  on  g(B[(x,y);e)).  Now,  there  exists  an  open  set,  O  C 
M[0, 1],  such  that  O  C  g[B((x,y)]  e)\  and  g*{0)  =  B((x,y);e).  Hence,  for  all  o  G  O, 
g~l(o)  G  B((x,y);e).  Since,  glg^1^)}  G  g(jB[(x,  y);  e)  j  for  all  o  G  O,  we  have  that 
B({x,  y);e )  C  S  is  homeomorphic  to  O  C  M1,  with 

9  ■  ®{(.x,y)-,£)  -»■  O 

being  the  homeomorphism,  so  that  S  is  a  1-manifold  in  ROC-space.  <) 

An  example  is  seen  in  Figure  2.1.  This  proof  can  be  extended  to  show  that  a  ROC 
surface  in  n-space  is  a  ROC  (n  —  l)-manifold,  the  basis  of  the  manifold  being  the  points 
on  the  ROC  surface  corresponding  to  (r i,  r2, . . . ,  rn_i,  xn)  where  the  first  n  —  1  compo¬ 
nents  are  rational  numbers,  with  being  the  dependent  component,  along  with  rational 
radii  in  an  (n  —  l)-ball  open  relative  to  the  M'1  topology. 

2.2.3  ROC  n-Space.  We  will  retain  the  conventional  language  of  ROC  space 
and  offer  an  extension  to  n2  dimensions.  Suppose  we  have  a  multi-class  label  set  (a 
label  set  with  more  than  two  labels).  To  construct  a  corresponding  ROC  space,  in  the 
case  of  m  >  2  labels,  we  desire  to  have  n  =  m 2  —  m  axes,  so  we  will  designate  this 
ROC  space  as  a  ROC  n-space.  This  is  due  to  the  fact  that  when  there  are  m  classes, 
the  number  of  possible  types  of  classifications  of  the  classification  system  is  m 2  and  the 
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number  of  conjunctive  conditional  probability  equations  is  m  (which  also  corresponds  to 
the  number  of  correct  classifications),  so  that  there  are  m 2  —  m  degrees  of  freedom  left 
after  the  application  of  the  conjunctive  equations  (instead  of  the  usual  fact  of  contingency 
tables  allowing  rri2  —  1  degrees  of  freedom),  which  we  have  already  seen  in  the  case  of 
m  —  2  with  the  application  of  the  conjuctive  equations  in  Equations  2.6  and  2.7.  So  if  we 
associate  a  correct  classification  with  the  m  conjunctive  equations,  then  we  have  m 2  —  m 
incorrect  classifications  corresponding  to  the  degrees  of  freedom,  each  demanding  its  own 
axis  in  ROC  n-space.  If  we  were  to  allow  all  errors  to  have  equal  cost,  then  we  can 
combine  all  errors  within  a  class,  and  we  would  then  have  n  =  m2  —  m(m  —  1)  =  m 
degrees  of  freedom,  which  is  the  same  as  the  number  of  classes,  each  one  requiring  its 
own  axis.  When  m  =  2,  we  have  that  n  =  2,  which  results  in  the  typical  ROC  space  of 
ROC  curves. 

In  the  case  of  three  classes,  m  =  3,  as  an  example,  examine  the  conjunctive  condi¬ 
tional  probability  equations  (with  notation  suppressed  with  respect  to  the  sample  functions 
involved), 


Pl|l  +P2|1  +P3|1  =  1 

Pl\2  +  P2\2  +  P3|2  =  1 
Pl\3  +  7>2|3  +  P3|3  =  1 


for  i,  j  =  1,  2,  3.  This  system  could  be  described  by  a  3  x  3  stochastic  matrix.  Notice  that 
once  the  errors  of  each  row  are  given,  the  correct  classification  is  completely  determined 
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by  the  equation.  Additionally,  the  equations  could  be  rewritten  as 


P2|l+P3|l  =  1-Pl|l 

Pl\2+P'i\2  =  1  —  P2|2 

Pl|3+P2|3  =  1—  P3|3 


so  that  the  ROC  space  needed  to  describe  the  system  completely  is  now  3-space  due  to  all 
costs  being  equal  (in  this  case,  cost  qj  =  1,  V  i,j ).  There  is  a  relationship  between  the 
dimensionality  of  a  parameter  set  T  and  the  dimensionality  of  the  ROC  manifold.  Ulti¬ 
mately,  we  want  to  construct  ROC  manifolds  which  allow  a  unique  optimization  point  to 
be  embedded  in  the  manifold,  while  maintaining  independence  of  the  conditional  proba¬ 
bilities  of  n  —  1  classes.  If 

r  =  dim(T)  >  n  —  1, 

then  there  are  several  optimal  points  embedded  in  the  ROC  r-manifold,  so  that  a  unique 
solution  cannot  be  found  analytically.  If  r  <  n  —  1,  then  a  unique  optimization  point 
embedded  in  the  ROC  r-manifold  can  be  found,  but  independent  control  over  all  of  the 
conditional  probabilities  is  lost  and  information  corresponding  to  each  class  is  incomplete. 
Therefore,  when  we  refer  to  ROC  n-space,  the  ROC  manifolds  assumed  to  inhabit  it  are 
ROC  (n  —  l)-manifolds  unless  otherwise  declared.  This  means  the  parameter  space  T  is 
assumed  to  be  of  dimension  n  —  1,  and  this  guarantees  a  unique  optimization  point  with 
respect  to  the  assumptions  on  prior  probabilities  and  costs. 

2.2.4  Convergence  of  Receiver  Operating  Characteristic  (ROC)  curves.  Albert 
Einstein  once  said,  ’’Not  everything  that  can  be  counted  counts,  and  not  everything  that 
counts  can  be  counted.”  [47]  Part  of  the  reason  we  use  ROC  curves  is  due  to  their  inherent 
dependency  upon  probability  theory.  Some  sets  have  measure  (they  count,  but  may  not  be 
countable),  and  some  have  none  (they  don’t  count,  though  they  may  be  countable).  The 


32 


ROC  curve  is  a  graph  of  tradeoffs  of  the  errors  made  by  families  of  classification  systems. 
Virtually  all  ROC  manifolds  are  estimates  of  performance  and  do  not  meet  the  theoretical 
constraints  we  have  defined.  However,  Alsing,  in  his  Ph.D.  dissertation  [4],  put  forth 
a  theorem  which  shows  that  estimates  of  ROC  curves,  created  from  calculating  the  true 
positive  and  false  positive  rates,  converge  to  the  ROC  curve  of  a  family  of  classification 
systems.  We  rely  upon  this  convergence  when  we  discuss  the  theory,  because  without  it, 
not  much  makes  sense.  Therefore,  we  refer  to  the  ROC  curve.  Alsing’s  proof  of  ROC 
convergence  focused  on  two  things: 

1.  (t)  are  estimates  (random  variables)  which  depend  upon  the  actual  (he  says  fi¬ 
nite)  data  collected  during  the  test. 

2.  There  is  a  collection  of  metrics  which  show,  for  P„(t)  =  and 

metric  d  in  his  collection,  that 


lim  d(pn(t),P(t)]  =  0 

n— >00  V  / 

for  some  P.  This  P  is  referred  to  as  the  [emphasis  mine]  ROC  curve. 

With  this  proof  we  can  theorize  more  about  the  actual  underlying  ROC  curves  and  compare 
the  systems  they  represent  without  much  worry  over  our  goals.  After  all,  if  we  have  a 
non-continuous  collection  of  ROC  points  from  a  family  of  classification  systems,  we  can 
approximate  the  underlying  continuous  ROC  curve  by  connecting  the  points  with  straight 
lines.  We  can  then  imagine  a  sequence  of  such  ROC  curves  converging  to  the  ROC  curve 
in  order  to  talk  about  where  the  optimal  points  on  the  curve  are,  and  perhaps  to  compare 
one  curve  generated  by  a  family  of  classification  systems  to  a  finite  number  of  other  curves 
generated  by  different  families  of  classification  systems. 

There  are  a  few  problems  with  Alsing’s  approach  and  proof.  First,  it  relies  upon  the 
assumption  of  convergence  of  the  sets  of  finite  feature  vectors  to  the  sample  space.  There 
are  two  errors  with  this  statement.  First,  I  believe  he  means  to  say  that  in  some  way  he 
can  take  countably  increasing  random  samples  of  feature  vectors,  which  are  converging 
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to  the  test  sample  space.  This  convergence  is  described  as  being  in  the  Hausdorff  metric. 
This  is  the  second  error,  since  although  the  Hausdorff  metric  is  calculated  using  the  clo¬ 
sure  of  each  set  relative  to  the  other,  there  is  no  way  one  can  accomplish  this  with  even  a 
countable  collection  of  random  variables.  Additionallly,  he  relies  upon  ‘balanced’  sam¬ 
ples  of  class  1  and  class  2  objects  of  detection.  This  is  unrealistic  and  unnecessary  to  the 
proof  when  using  the  Law  of  Large  Numbers  [5].  Therefore,  Hausdorff  measure  is  not 
a  sufficient  constraint  for  the  theorem,  the  sets  of  random  samples  need  to  be  constructed 
appropriately,  and  there  is  no  need  for  these  sets  to  strive  for  balance  if  they  are  truly 
random  samples.  Furthermore,  there  needs  to  be  a  proof  showing  convergence  of  ROC 
curves  when  the  test  sample  spaces  are  a  collection  of  countably  increasing  nested  sample 
spaces  with  the  population  sample  space  as  the  union  of  the  collection.  Together  these 
two  proofs  would  demonstrate  that  as  you  increase  the  number  of  random  samples  from 
a  test  sample  space,  the  conditional  probabilities  (and  the  ROC  manifolds)  converge  to 
the  expected  values  almost  surely,  and  that  as  you  nest  your  sample  spaces  in  a  countably 
infinite  fashion,  your  conditional  probabilities  (and  the  ROC  manifold)  converges  almost 
surely.  This  is  important  if  you  are  going  to  use  ROC  manifolds  from  a  test  as  a  measure 
of  performance.  Thus,  if  we  set  up  the  test  to  reflect  as  accurately  as  possible  the  real 
world,  and  we  take  enough  random  samples,  we  can  have  confidence  in  using  the  ROC 
curve  as  a  performance  characteristic  of  families  of  classification  systems  participating  in 
the  same  procedure. 

Alsing  begins  his  proof  by  showing  that  p^V(-)  (my  notation,  not  his)  is  a  consistent 
estimator.  He  shows  it  is  a  consistent  estimator  of  the  mean.  This  is  true  due  to  the  weak 
law  of  large  numbers  (in  his  proof  he  applies  Chebyshev’s  inequality),  so  that  for  each 
t  E  T  we  have  that 

plim  p£(t)  =  (2.10) 

n— xx) 

for  some  mean  value  ij,  where  plim  denotes  the  limit  in  probability.  Alsing  does  not 

71— XX 

characterize  the  values  7T;j  beyond  this  or  P(-)  =  ^[>\\2i:)-P\\\  (')'j  (the  ROC  curve),  and 
fails  to  connect  the  expected  values  of  his  random  variables  to  the  actual  conditional  prob- 
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abilities  he’s  trying  to  prove  his  estimates  to  converge  to.  Moreover,  his  collection  of 
metrics  seems  to  require  the  measure  space  (T,  d? ,  //)  from  which  we  draw  the  parameter 
t  be  such  that  p(T)  <  oo  (that  is,  a  finite  measure  space).  This  is  certainly  not  correct  for 
T  =  M  and  /;  being  Lebesgue  measure.  Even  the  simplest  of  toy  problems  has  T  =  M, 
so  then  no  positive,  translation-invariant  measure  can  be  used  to  scale  T  down  to  a  finite 
measured  set. 

Therefore,  we  offer  a  proof  of  convergence  that  characterizes  better  the  nature  of  ROC 
convergence,  and  we  extend  the  result  to  problems  with  classes  greater  than  2. 

Theorem  1  (Extension  of  Alsing’s  ROC  Convergence,  Convergence  of  ROC  mani¬ 
folds).  Let  k  G  N  be  given,  m  —  k2  —  k.  Given  denumerable,  nested  partitions  of  random 
samples,  whose  union  is  a  sample  population,  the  sequence  of  ROC  (m  —  1) -manifolds, 
constructed  from  sample  functions  with  parameter  set  ©,  a  a -finite  measure  space,  con¬ 
verges  to  a  ROC  manifold. 

Proof:  Let  (Q,  !dS,  p)  be  a  probability  space  and  (©,  .f/.  //)  be  a  a- finite  measure  space 
of  parameters.  Let  {Clj}j=1  be  a  partition  of  Cl  into  k  classes.  Let  On  G  dd  for  each 

OO 

n  G  N.  Let  On  ]  Cl  as  n  — >  oo,  /.<?.,  Oi  C  02  C  . . .  On  C  . . .  C  Cl  and  On  =  Cl.  Lor 

n= 1 

each  n  let  {On j}j=1  be  a  partition  of  On  into  the  k  classes.  We  assume  Onj  f  0  for  each 
n,j  G  N  and  that  Onj  ]  Clj  as  n  —>  oc  for  each  1  <  j  <  k.  Let 

0  <i<n 
1  <j<k 

Let  A  be  a  family  of  classification  systems  of  Cl,  so  that  for  each  parameter  6  G  ©, 
Ae  :  Cl  — >  O  defines  a  discrete,  ^-measurable  random  variable. 

Denote  by  A^fk)  the  preimage  of  class  k  under  Ae.  Let  On  be  the  sample  space 
of  the  nth  instantiation  of  data.  Now  fix  6  G  ©,  where  6  =  (#i,  d2, . . . ,  9k 2_k)  and  let 
A  =  {<5i,  52,  ■  ■  •}  be  a  discrete  index  set.  Then  for  each  Onj  we  can  construct  a  new 
probability  space,  (Onj,  A8n,3 ,  dn,j),  where  ddnj  is  a  a- field  on  On,j,  with  ddnj  C  dd,  and 
each  B  G  d$nj.  Let  Cl%J.n  =  A^(i)  n  OnJ 
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Now,  for  each  5r  G  A,  construct  the  random  variable 


Af(9..)  =  Ic,J,„(.), 

where  I  is  an  indicator  function.  This  random  variable  essentially  tells  us  whether  or 
not  an  u  e  ()n  ]  is  classified  as  an  error  or  not.  Then  the  expected  value  of  the  random 
variable  is 


E{I 


Ci, 


} 


// (On.j 


=  p(4MIOw) 


=  Pi\j(0)- 


Now  let 


?S>)  = 


i\j 

Sr* 


r=  1 


Since  are  independent  identically  distributed  random  variables,  by  the  strong 

Law  of  Large  Numbers  [5],  we  have  that 


Pi£\0)  — >  P  (Al(i)\On,j)  =  Ply(0)  almost  surely  /lasm^oo. 
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Consider  wlog  that  error  7^™^  is  the  dependent  variable  with  respect  to  the  classification 
system.  Then  fixing  the  parameter  9  e  0  and  letting 


p“(»)  =  «’((e).4?(«).  ■  ■  ■  ■  ■  ■  ,^’(0), 


Am) 


(m). 


Am )  / 


irk-l\k)i 


(2.11) 


we  have  that  the  set  of  ( k 2  —  &) -vectors 

| P(m)(6>)  :  0  G  ©|  (2.12) 

forms  an  estimate  of  the  ROC  manifold.  We  assume  it  is  a  proper  ROC  manifold,  and  that 
(0,  ()  is  a  a-finite  measure  space.  Let 


(2.13) 


and  now  consider  the  product  measure  //x(.  Thus,  by  Fubini’s  Theorem  [44]  we  have 
that 


lim 

m— >00 


pM(fl) 


P{n\0)  d{n  x  C) 


lim 

m— >00 


lim 

m—>oo 


/ 

[  1 

P(m)(6>)  -P(n)(0)|  dfi 

J® 

Jn 

J 

[ 

f  \ 

p(m\e)  -  p{n\e)  \  d( 

I  ft 

J® 

J 

d( 

dfi 


0, 


so  that  P(jn\d)  — >  P^n\G)  almost  everywhere  /i  x  C 

Next,  we  offer  a  continuation  of  the  idea  of  convergence  by  now  considerinig  a  ROC 
manifold  convergence.  This  convergence  is  similar  to  the  convergence  of  distribution 
functions  with  the  exceptions  that,  1)  because  ROCs  are  inherently  connected  to  probabil¬ 
ity  measures,  any  convergence  can  only  be  as  strong  as  convergence  almost  everywhere, 
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and  2)  the  converging  sequence  of  ROCs  is  constructed  by  building  up  smaller  probability 
spaces  into  a  universal  one  (universal  with  respect  to  the  population). 

Theorem  2  (ROC  Convergence).  Let  k  G  N  be  given,  m  =  k2  —  k.  Given  denumer¬ 
able,  nested  partitions  of  random  samples  within  denumerable,  nested  partitions  of  sam¬ 
ple  populations,  whose  union  is  the  population,  the  sequence  of  ROC  (m  —  l) -manifolds, 
constructed  from  sample  functions  with  parameter  set  0,  converges  to  the  ROC  manifold. 

Proof:  Let  the  assumptions  be  the  same  and  the  estimates  be  the  same  as  the  results  in 
Theorem  1.  Let  P-^{&)  =  P(Ag(i)\Onj)  be  the  estimate  of  the  conditional  probability, 
Pi\j(9)  =  P(A^0(i)\Qj).  Now  consider  the  following  two  notes: 

1.  Since  Onj  ]  Oj,  3  A3  (0)  e  N  such  that  for  n  >  A3  we  have  that 


HOf  p(0nj)  |  < 


ea 

2 n  Oj)  +  1 


for  each  i,  1  <  i  <  k  and  each  j,  1  <  j  <  k. 

2.  Consider  that  (A^e(i)  U  Onj)  C  (Ag(i)  U  Of.  Since 


(4(*)  u  °n,j )  =  +  p(OnJ)  -  p,(Ag(i)  n  On,f 


and 

KW  U  Oj)  =  ^.(0)  +  MOj)  -  n  Oy), 

then  by  the  monotonicity  of  p,  we  have  that 

+  d(On,j)  -  d(Ag(i)  n  OnJ)  <  p{A\(i))  +  fi(Oj)  -  p{A\(i)  n  Of, 

so  that 

d(A^(i)  n  °j)  ~  n  °n,j)  <  p,{Of  -  p(OnJ). 
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Thus,  since  the  left  side  of  the  equation  is  non-negative,  we  have  that 


\KAe(i)  n  °j)  ~  n  °n,j) |  <  HOj)  -  n(OnJ)\. 


Now  3  jV2  eN  such  that  for  n  <  N2  we  have  that 


HOj)  -n{onj)\  < 


e  a 

2  UOj)\' 


for  all  j.  Thus, 


W4>(0  n  Oj)  -  I^A ‘  (!)  n  onJ) \  < 


for  all  j. 
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Now  let  N  =  inax{  Ari ,  N2}.  Then  for  each  6,  and  n  <  N,  we  have  that 

|M4?(*)n  Oj) 


»{Oj)  A l{Onj) 

n(OnJ)n(Al(i)  n  Oj)  -  n(Pj)n{A\{i)  n  OnJ 


< 


ji  ( CJj )  ji  ( On  j ) 

n(On,j)^(Al(i)  n  Oj)  -  fi(Oj)fi(Al(i)  n  Oj 


/./  (  Oj  )  fl{Onj  ! 

n  Oj)  -  fi(Oj)fi(Al(i)  n  onJ) 


< 


^{Oj) 

n  Oj)\\fi(OnJ)  -  n{Oj 


a 


M4(*)  n  °i)  ~  t*(Ae(i)  n  °nj)  MQ 


a 


< 


£  £ 

<  2  +  2  =  £' 


M^)n  Oj) 

£  a 

2( 

n  °i) 

+  l)a 

l-i  (Oj 


£  a 


a 


(2.14) 


This  convergence  occurs  almost  everywhere,  a.e.,  since  it  cannot  be  shown  to  occur  over 
sets  of  //-measure  zero.  This  is  equivalent  to  almost  sure,  a.s.,  convergence  and  conver¬ 
gence  with  probability  1  (also  known  as  convergence  in  law),  since  we  are  using  probabil¬ 
ity  measure  //.  Recall  that  class  Pk-i\k  is  the  dependent  class  conditional  probability  with 
regard  to  the  classification  system.  Let 


(2.15) 
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Then  the  set  of  k2  —  k  —  1- vectors  over  all  ©, 

[Q{n\0)  :  e  G  ©},  (2.16) 

defines  the  nth  ROC  estimate  of  the  ROC  manifold.  We  assume  it  is  a  proper  ROC 
manifold.  Thus  for  each  n  £  N  there  exists  a  continuous  real- valued  function  in  A;2  —  k  —  1 
variables.  Let 

»n(Q<n)W)  =K-1|*(0) 

be  such  a  function.  Let 

Q(0)  =  (p2\i{0,p3\i(0), . . .  iPk\2(0), . . .  ,Pi\k(0),  ■  ■  ■  ,Pfc-2|fc(0))  (2-17) 

for  each  6  £  ©,  and  set 

g(Q(0))  =Pk-i\k(0) 

It  is  clear  from  Theorem  1,  that: 

1.  gn  is  continuous  on  ©; 

2-  f]n{Q{n) {6))  <  1  for  all  0  6  0;  and 
3.  gn(Q('n\0))  — *  g(Q(0))  a. e.  for  fixed  0. 

Then  for  £  >  0  given,  let  B(@;e)  be  an  open  e-ball  in  ©.  Thus,  by  the  Dominated 
Convergence  Theorem,  we  have  that 

lim  [  gn-gd(  =  0.  (2.18) 

,M0°  J B(0;s) 

so  that  lim  gn  =  g  a.e.  on  B(@;e).  This  convergence  is  uniform  a.e.  over  compact 
subsets  of  ©.  <) 
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III.  A  Category  Theory  Description  of  Fusion 

3.1  Probabilistic  Construction  of  the  Event-Label  Model 

Let  ^  be  a  complex  of  conditions  [28]  for  a  repeatable  experiment,  and  let  Cl  be  a  set 
of  outcomes  of  this  experiment  with  TcK  being  a  bounded  interval  of  time.  Interval  T 
sorts  Cl  such  that  we  call  E  C  Cl  x  T  an  event-state.  An  event-state  is  then  comprised  of 
event-state  elements,  e  =  (cu,  t )  G  E,  where  u  E  Cl  and  t  G  T.  Thus  e  denotes  a  state  c o  at 
an  instant  of  time  t.  Let  Cl  x  T,  be  the  set  of  all  event-states  for  an  event  over  time  interval 
T.  Let  £  be  a  rr-ficld  on  Cl  x  T.  and  //  be  a  probability  measure  defined  on  the  measurable 
space  (Cl  x  T,  £ ,  \i).  Then  the  triple  (Cl  x  T,  £,  //)  forms  a  probability  space  [5]. 

The  design  of  a  classification  system  involves  the  ability  to  detect  (or  sense)  the  oc¬ 
currence  of  an  event  in  Cl,  and  process  the  event  into  a  label  of  set  L.  For  example,  design 
a  system  that  detects  airborne  objects  and  classifies  them  friendly  or  unfriendly.  To  do 
this  a  classification  system  relies  on  several  mappings,  which  are  composed,  to  provide 
the  user  an  answer  (from  the  event,  to  the  label).  Since  £  is  a  rr- field  on  !!  x  T.  then  let 
E  G  £  be  any  member  of  £ .  Then  a  sensor,  s,  is  defined  as  a  mapping  from  E  into  a  (raw) 
data  set  D.  We  denote  this  with  the  diagram 

E  — D 

so  s(e)  =  d  G  D  for  all  e  G  E.  The  sensor  is  defined  to  produce  a  specific  data  type,  so  the 
codomain  of  s,  cod(s)  =  D,  where  D  is  the  set  describing  the  data  output  of  mapping  s. 
A  processor,  p,  of  this  system  must  have  domain,  dom(p)  =  D,  and  maps  to  a  codomain 
of  features,  F  (a  refined  data  set),  cod(p)  =  F.  This  is  denoted  by  the  diagram 


Further,  a  classifier,  c,  of  this  system  is  a  mapping  such  that  dom(c)  =  F  and  cod(c)  =  L, 
where  L  is  a  set  of  labels  the  user  of  the  system  finds  useful.  This  is  denoted  by  the 
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diagram 


F^L. 


Therefore,  we  can  denote  the  entire  classification  system,  which  is  diagrammed  as 


E 


as  A,  the  classification  system  over  an  event-state  E,  where  A  is  the  composition  of  map¬ 
pings 

A  =  c  o  p  o  s. 

Thus,  A  is  an  L-valued  random  variable  which  maps  members  E  e  $  into  the  label  set  L 
and  is  diagrammed  by 

E  L  • 

Consider  the  simple  model  of  a  multi-sensor  system  using  two  sensors  in  Figure  3.1. 
The  sets  E*,  for  i  e  {1,  2},  are  sets  of  event-states.  The  label  set  Lt  can  be  as  simple  as 

e1^d1^-f1^l1 


e2^-d2 


C2 


l2 


Figure  3.1:  Simple  Model  of  a  Dual-Sensor  System. 


the  two-class  set  {target,  non-target}  or  could  have  a  more  complex  structure  to  it,  such  as 
the  types  of  targets  and  non-targets,  paired  with  a  ranking  of  measure,  for  example  [56], 
in  order  to  define  the  battlefield  more  clearly  for  the  warfighter.  Now  the  diagram  in  Fig¬ 
ure  3.1  represents  a  pair  of  classification  systems  having  two  sensors,  two  processors,  and 
two  classifiers,  but  can  easily  be  extended  to  any  finite  number.  Now  consider  two  sensors 
not  necessarily  co-located.  Hence  they  may  sense  different  event-state  sets.  Figure  3.1 
models  two  sensors  with  differing  fields  of  view.  Performing  fusion  along  any  node  or 
edge  in  this  graph  could  possibly  result  in  an  elevated  level  of  fusion  [15]-that  of  situa- 
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tion  refinement  or  threat  refinement,  since  we  are  not  fusing  common  information  about  a 
particular  event  or  events,  but  we  may  be  fusing  situations. 

There  are  at  least  two  other  possible  scenarios  that  Figure  3.1  could  depict.  The  sen¬ 
sors  can  overlap  in  their  field  of  view,  either  partially  or  fully,  in  which  case  fusing  the 
information  regarding  event-states  within  the  intersection  may  be  useful.  Thus,  a  fusion 
process  may  be  used  to  increase  the  reliability  and  accuracy  of  the  classification  system, 
above  that  which  is  possessed  by  either  of  the  sensors  on  its  own.  Let  E  represent  that 
event-state  set  that  is  common  to  both  sensors,  that  is,  E  =  Ex  (T  E2.  Hence,  there  are 
two  fundamental  challenges  regarding  fusion.  The  first  is  how  to  fuse  information  from 
multiple  sources  regarding  common  event-states  (or  target- states,  if  preferred)  for  the  pur¬ 
pose  of  knowing  the  event-state  (presumably  for  the  purposes  of  tracking,  identifying,  and 
estimating  future  event- states).  This  is  commonly  referred  to  as  Level  1  fusion  (or  Level 
0  fusion)  Object  Assessment.  The  second  and  much  more  challenging  problem  is  to  fuse 
information  from  multiple  sources  regarding  event-states  not  common  to  all  sensors,  for 
the  purpose  of  knowing  the  state  of  a  situation  (the  situation-state),  such  as  an  enemy  situ¬ 
ation  or  threat  assessment.  These  are  the  higher  Levels  2  and  3,  Situation  Assessment  and 
Impact  Assessment.  We  distinguish  between  the  two  types  of  fusion  scenarios  discussed 
by  calling  them  event-state  fusion  and  situation-state  fusion  respectively.  Therefore, 
Figure  3.2  represents  an  Event-State-to-Label  model  of  a  dual  sensor  process.  The  only 


Figure  3.2:  Two  Classification  Systems  with  Overlapping  Fields  of  View. 

restriction  necessary  for  the  usefulness  of  this  model  is  that  a  common  field  of  view,  E, 
be  used.  Consequently,  D  |  and  D2  could  actually  be  the  same  data  set  under  the  model, 
while  si  and  s2  could  be  different  sensors.  We  will  refer  to  a  finite  number  of  families  of 
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classification  systems,  such  as  the  two  in  Figure  3.2,  which  we  wish  to  explore  the  fusion 
of,  as  a  fixed  classification  category.  For  $  considered  as  a  category  of  sets,  and  a  fixed 
label  set  L,  we  note  that  ,  is  the  functor  category  of  all  such  classification  systems,  so 
that  our  fixed  classification  category  is  a  subcategory  of  L 6 .  Each  classification  system  or 
set  of  sample  functions  comprises  a  fixed  branch  of  L s  (i.  e. .  a  functor  or  a  family  of  func¬ 
tors).  Equally  true  is  the  fact  that  if  we  want  to  compete  classification  systems,  we  must 
test  them  over  the  same  sample  space  as  well.  Therefore,  we  choose  the  functor  category 
Le,  with  a  fixed  L  and  a  fixed  E,  to  compete  the  classification  systems  over.  Our  conver¬ 
gence  theorems  allow  us  to  treat  E  as  if  it  were  the  sample  population,  with  the  caveat  that 
our  test  then  is  only  as  good  as  it  is  representative  of  the  operational  circumstances  of  the 
real-world  population. 

It  is  also  important  to  note  that  when  we  want  to  fuse  classification  systems  (or  fami¬ 
lies  of  classification  systems  heretofore  denoted  as  sample  functions),  we  must  be  fusing 
systems  which  are  originally  yielding  values  from  the  same  label  set  L,  and  not  just  the 
same  set  up  to  isomorphism.  We  will  later  show  that  there  are  two  kinds  of  fusion  with 
regard  to  these  label  sets,  but  for  right  now,  we  consider  fusing  only  those  branches  which 
produce  values  in  the  same  exact  set.  Additional  considerations  and  techniques  must  be 
used  to  fuse  across  different  label  sets. 

3.2  Construction  of  a  family  of  classification  systems 

3.2.1  Single  Parameter.  Now  suppose  we  have  a  parameter  0  6  0,  which  is 
possibly  multidimensional.  Then  it  is  common  that  there  is  a  family,  { Co  :  9  e  0},  of 
classifiers  so  that  for  each  9  e  0,  each  composition, 

Cq  o  p  o  s 

describes  an  event-state  model  on  fixed  E  e  S,  and  fixed  sets  D,  F,  and  L.  The  corre¬ 
sponding  family 

A  =  {A0\  9eG}, 
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where  Ae  =  eg  o  p  o  s,  is  a  family  of  classification  systems.  Thus,  0  acts  as  an  indexing 
set  for  defining  A  which  also  could  be  thought  of  as  a  collection  of  sample  functions  or 
sample  sequences  depending  on  whether  or  not  the  parameter  set  is  countable. 

3.2.2  Multiple  Parameter  .  One  can  extend  the  ideas  in  Section  3.2.1  to  include 
other  index  sets  V  and  A,  so  that  the  composition 

CQ  O  Pfi  O  Sry^ 

where  9  e  0, 5  e  A,  7  e  T,  is  a  classifier,  A^^)-  In  this  case,  we  must  look  at  the 
triple  ( 9 ,  <5, 7)  as  the  parameters  for  the  ROC  manifold.  If  we  have  a  two-class  label  set, 
then  this  presents  us  with  the  case  of  degeneracies.  For  example,  suppose  we  calculate 
the  optimal  point  on  the  ROC  1 -manifold.  Then  we  have  three  parameters  representing 
each  point  on  the  curve,  so  that  there  may  be  multiple  triples  which  optimize,  none  better 
than  the  others.  This  fact  alone  may  make  it  difficult  to  calculate  an  optimal  triple,  since 
no  inverse  function  mapping  ROC  points  on  the  curve  to  the  product  space  0  x  A  x  T 
exists.  To  eliminate  degeneracies,  given  k  classes,  we  require  k2  —  k  —  1  —  m  —  1 
parameters.  Any  more  than  this  yields  such  degeneracies,  while  any  fewer  results  in 
either  a  smaller  dimensional  ROC  manifold,  or  a  set  of  ROCs  which  is  not  a  manifold  and 
possibly  a  suboptimal  choice  of  operating  parameters  (suboptimal  with  respect  to  a  ROC 
(m  —  1) -manifold). 

3.3  Defining  Fusion  Rules  from  the  Event-Label  Model 

At  this  point  we  begin  to  consider  categories  generated  by  the  model’s  sets  of  data. 
Let  V  =  (D.  IcId-  IcId,  o)  be  the  discrete  category  generated  by  data  set  D.  We  use  these 
categories  to  define  fusion  rules  of  classification  systems. 

Definition  26  (Fusion  Rule  of  n  Fixed  Branches  of  Families  of  Classification  Systems). 

Let  &n  be  a  fixed  classification  category  with  n  branches.  For  each  i  =  1, . . . ,  n,  let 
Oi  €  CAT  be  a  small  category  of  data  corresponding  to  the  / 1  h  branch’s  source  of  data  to 
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be  fused  (this  could  be  raw  data,  features,  or  labels).  Then  the  product 

n 

n(n)  = 

i= 1 

is  a  product  category.  For  any  particular  category  of  data,  O0,  the  exponential,  Oq("\  is  a 
category  of  fusion  rules,  each  rule  of  which  maps  the  products  of  data  objects  Ob(7r(ra)) 
to  a  data  object  in  Ob((T0),  and  maps  data  arrows  in  Ar(7r(n))  to  arrows  in  Ar(C?0). 
These  fusion  rules  are  functors,  which  make  up  the  objects  of  the  category.  The  arrows 
of  the  functor  category  are  all  the  natural  transformations  between  them.  We  designate 
F Rqti  ( Oq )  to  be  this  functor  category  of  fusion  rules. 

If  the  Oi  are  categories  generated  from  sensor  sources  (/.<?.,  outputs),  then  we  call 
Oq<u^  a  category  of  data-fusion  rules  and  use  the  symbols  T)^"u  .  The  fusion  rule  branch 
would  then  be  diagrammed  like  this: 


<Sl,S2,  —  ,Sn>  /  , 

E - ^  7T  (n)i 


Dn 


C0 


(3.1) 


where  D0  is  the  receiving  category,  r  is  the  fusion  rule,  and  <  si,  s2, . . . ,  sn  >  is  the 
unique  arrow  generated  by  the  product  n(n)i.  If  the  categories  are  generated  by  processor 
sources,  then  call  a  category  of  feature-fusion  rules  and  use  the  symbols  F^(n>2 . 

fusion  rule  branch  would  then  be  diagrammed  like  this: 


<Si,S2,  —  ,Sn>  /  x  <pi,P2,--,Pn>  ,  x 

E - ^  7r(n) i - vr (n)2 


L, 


(3.2) 


where  vr(n)i  is  the  first  product  of  data  categories,  n (n)2  is  the  second  product  of  feature 
categories,  r  is  again  the  fusion  rule,  and  <  pi,p2, . . .  ,pn  >  is  the  unique  arrow  generated 
by  the  product  ir(n)2.  Finally,  if  they  have  classifiers  as  sources,  then  call  them  label- 
fusion  rules  (or,  alternatively,  decision-fusion  rules)  and  use  the  symbols  £q^3.  This 
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fusion  rule  branch  would  be  diagrammed  like  this: 


E - n(n) i - >■  7r(n)2 - ^  vr(n)3 


»> 


(3.3) 


where  is  a  fusion  rule  for  each  parameter  (in  order  to  generate  an  appropriate  family 

of  classification  systems),  and  <  ci,  c2, _ ,  cn  >  is  the  unique  arrow  generated  by  the 

product  vr(n)3.  (  We  removed  the  parameters  from  the  classifiers  and  replaced  them  with 
a  single,  possibly  vector  valued,  parameter  on  the  fusion  rule). 

A  fusion  rule  could  be  a  Boolean  rule,  a  filter,  an  estimator,  or  an  algorithm.  Notice 
that  our  definition  of  fusion  rule  does  not  include  a  qualitative  component;  there  is  no 
necessary  condition  of  “betterness”  for  a  fusion  rule.  The  result  of  applying  a  fusion  rule 
to  an  existing  set  of  fundamental  branches  could  result  in  output  considerably  worse  than 
existed  previously.  This  does  not  affect  the  definition.  First  we  define  fusion  rules  as  the 
key  component  of  the  fusion  process.  Next,  we  pare  down  the  category  to  a  subcategory 
which  does  include  a  qualitative  component,  with  one  suggested  way  of  accomplishing 
this.  We  now  desire  to  show  how  defining  a  fusor  (see  Definition  30)  as  a  fusion  rule  with 
a  constraint  changes  the  Event-State  model  into  an  Event-State  Fusion  model.  Continuing 
to  consider  the  two  families  of  classification  systems  in  Figure  3.2,  it  is  evident  that  a 
fusion  rule  can  be  designed  which  would  apply  to  either  the  data  sets,  the  feature  sets, 
or  the  label  sets  (though  special  care  needs  to  be  taken  with  this  case,  when  the  actual 
labels  are  not  the  same).  Given  a  fusion  rule  93  for  the  two  data  sets  as  in  Figure  3.2,  our 
model  becomes  that  of  Figure  3.3.  A  new  data  set,  processor,  feature  set,  and  classifier 
may  become  necessary  as  a  result  of  the  fusion  rule  having  a  different  codomain  than  the 
previous  systems.  The  label  set  may  change  also,  but  for  now,  consider  a  two  class  label 
set,  that  of 

L  =  Li  =  L2  =  (Target,  Nontarget}, 

where  the  targets  and  non-targets  are  well-defined  across  classification  systems  (i.e.,  each 
classification  is  identifying  targets  that  satisfy  the  same  definition  of  what  a  target  is).  In 
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Di 


D2 


Figure  3.3:  Fusion  Rule  Applied  on  Data  Categories  from  Two  Fixed  Branches. 

a  within-  fusion  scenario  (see  Definition  34  as  opposed  to  Definition  35),  the  data  sets  (or 
feature  sets)  are  the  same,  Dx  =  D2  =  D3.  This  is  true  in  the  case  that  the  sensors  used 
are  the  same  type  (that  is,  they  collect  the  same  types  of  measurements,  but  from  possibly 
different  locations  relative  to  the  overlapping  field  of  view).  In  the  case  where  the  data 
sets  (or  feature  sets)  are  truly  different,  a  composite  data  set  (and/or  feature  set)  which  is 
different  from  the  first  two  (possibly  even  the  product  of  the  first  two)  is  created  as  the 
codomain  of  the  fusion  rule  functor. 

Now  at  this  point  we  may  consider,  in  what  way  is  the  process  modeled  in  Figure  3.3 
superior  to  the  original  processes  shown  in  Figure  3.2  when  L  =  Lx  =  L2  (we  will  deal 
with  the  case  Lx  ^  L2  later)?  One  way  of  comparing  performance  in  such  systems  is  to 
compare  the  processes’  receiver  operating  characteristics  (ROC)  curves,  which  we  will  do 
in  the  Chapter  IV. 

3.4  Fusion  Rules 

3.4.1  Object-Fusion.  There  are,  of  course,  multiple  descriptions  in  the  literature 
to  “types”  of  fusion.  There  is  Jata-fusion,  feature- fusion,  and  decision- fusion.  There  is 
data  in-feature  out  fusion  [8]  and  many  more.  We  would  like  to  codify  what  should  be 
meant  by  these  expressions  by  introducing,  in  its  most  basic  form,  a  vernacular  for  fusion 
which  is  intuitive,  yet  has  its  definition  rooted  in  mathematics.  We  start  by  assuming  we 
have  a  finite  number  of  objects  we  wish  to  fuse  together.  What  does  the  finite  set  of  fusion 
rules  look  like?  How  can  we  describe  in  an  observational  way  what  is  going  on?  Once 
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the  definition  of  fusion  is  established,  we  can  move  on  to  labeling  types  of  fusion  under 
certain  model  asumptions. 

Definition  27  (Object-Fusion  Category).  Let  {Ot  j  %  e  {1, . . . ,  m}}  be  a  finite  sequence 
of  non-empty  categories  (possibly  discrete).  Then 

m 

n°. 

i= 1 

defines  a  product  category  (see  Definitions  24  and  26).  Let 

m 

n(m)  =  Oi 
1=1 

for  fixed  m  e  N.  Then  for  a  fixed  category  O,  we  have  that 

FR  7r(m}(0)  =  0^ 

is  a  functor  category.  The  functor  category  FR-{m)  (O)  is  called  an  7r(m)-Fusion  category 
relative  to  O  to  denote  the  functors  are  fusing  m  0,- objects,  and  as  necessary,  their 
accompanying  arrows  into  a  single  object  and  arrow  in  O.  When  the  relationship  of  all  the 
Ox  objects  can  be  made  clear,  by  simply  calling  them  “objects”,  then  we  call  FR7r(m)(0) 
the  Object-Fusion  category  relative  to  O  (regardless  of  the  value  of  m). 

It’s  important  to  note  in  our  definition  of  fusion  rules  we  did  not  put  forward  the  notion 
of  defining  fusion  rules  in  terms  of  performance.  We  will  need  a  second  mathematical 
definition  later  to  narrow  the  category  of  fusion  rules  down  to  a  subcategory  of  fusion 
rules,  which  can  be  ordered  according  to  their  performance  in  some  manner.  First  we’ll 
consider  further  delineating  the  types  of  fusion  rules  within  the  Event-State  model. 

3.4.2  Types  of  Fusion  Rules.  We  consider  digraph  G,  as  depicted  in  Figure  3.4, 
consisting  of  sample  functions  which  are  compositions  of  random  variables.  E  is  an  event 
in  the  o  —  field,  <§ .  The  sets  Di  and  D2  are  objects  of  a  finite  collection  of  categories 
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of  data  sets,  while  the  sets  Fi  and  F2  are  objects  of  a  finite  collection  of  categories  of 
feature  sets.  The  label  sets  Li  and  L2  are  the  objects  of  a  finite  collection  of  categories 
of  label  sets  (and  we  still  require  that  L1  =  L2).  Figure  3.5  shows  the  nodes  in  digraph 


pi  ce 

U\ - ^  b  i - >■  Li 


Figure  3.4:  Digraph  G. 


G  along  which  fusion  rules  can  be  applied.  With  the  use  of  category  theory,  we  can  also 


fusion  fusion  fusion 

Figure  3.5:  Known  Fusion  Rule  Nodes  of  Digraph  G. 

describe  that  there  should  theoretically  be  nodes  along  the  arrows  of  digraph  G  for  fusion 
rules  as  well,  though  we  have  no  example  at  this  time  of  a  rule  or  algorithm  that  does  this 
without  using  the  pointwise  outputs  of  the  arrows.  Figure  3.6  shows  all  available  fusion 
rule  nodes  applicable  (at  least  theoretically)  to  the  event-state  decision  model.  This  leads 
to  a  theorem  regarding  the  types  of  fusion  available  under  the  model. 

Theorem  3  (Six  Categories  of  Object-Fusion  under  digraph  G).  Let  G  be  a  digraph 
with  an  initial  vertex  and  n  branches  with  k  vertices  to  each  branch,  so  that  there  are 
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fusion  fusion  fusion 

Figure  3.6:  Theoretical  Fusion  Rule  Nodes  of  Digraph  G. 


nk  —  (n  —  1)  =  n(k  —  1)  +  1  total  vertices  and  n(k  —  1)  edges.  Then  there  exists  2(k  —  l) 
categories  of  Object-Fusion  that  can  be  performed  on  any  event-state  decision  model  that 
G  represents. 

Proof:  Excluding  the  event  E,  there  are  an  equal  number  of  edges  and  vertices  to  each 
branch.  The  initial  vertex  represents  the  event  set  while  the  composition  of  arrows  (edges) 
along  each  branch  represent  the  classification  system.  Fusion  rules  are  objects  within 
functor  categories,  so  that  if  we  label  the  non-initial  vertices  matrix  style  with  rows  repre¬ 
senting  branches  and  k  columns  representing  the  vertices: 

vn  i’i2  ■  ■  ■  vik  (3.4) 

v2i  V22  •  •  •  : 


C  n 1  • •  •  •  •  •  V nk 
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The  components  of  column  j  can  be  considered  as  being  i  categories  from  a  finite  subcat¬ 
egory  of  the  category  CAT.  Suppose  that  in  column  j  we  have  categories 
Oi,  i  —  1,  2, . . .  ,n.  Then 

n 

n(ih = n Qi 

i=i 

is  a  product  category  for  the  jth  column.  Let  O  be  any  category.  Then  the  functor 
category  FR*.^ .  (O)  =  07r^i  is  the  n(i)j -Fusion  category  relative  to  O.  Furthermore,  in 
addition  to  labeling  the  non-initial  vertices  as  matrix  components,  we  can  create  a  matrix 
from  the  edges  in  the  same  manner,  and  without  loss  of  generality,  the  result  is  k  more 
fusion  categories,  so  that  the  total  number  of  fusion  categories  is  2 (k  —  1).  <) 

When  the  number  of  vertices  per  branch,  k  —  4,  as  in  digraph  Go  (see  Figure  3.4),  then 
we  have  six  (2  ■  (4  —  1)  =  6)  categories  of  Object-Fusion.  Adopting  the  labeling  scheme 
used  by  our  model,  we  can  label  each  category’s  “objects”  as  Sensor-,  Data-,  Processor-, 
Feature-,  Classifier-,  or  Label-  (or  Decision-jFusion. 

3.4.3  Comparison  of  Desarathys  paradigm  with  Fusion  Categories.  The  chart  in 
Table  3.1  shows  the  relationship  between  these  categories  and  Desarathy’s  breakdown  of 
the  types  of  fusion. 


Desarathy’s  I/O  taxonomy 

Category  Theory  Approach 

No  taxonomy 

Sensor-Fusion 

Data  In-Data  Out 

Data-Fusion 

Data  In-Feature  Out 

Processor-Fusion  or  Data-Fusion 

Feature  In-Feature  Out 

Feature-Fusion 

Feature  In-Decision  Out 

Classifier-Fusion  or  Feature-Fusion 

Decision  In-Decision  Out 

Label-Fusion  (also  called  Decision-Fusion) 

Table  3.1:  Desarathy’s  I/O  Fusion  categorization  from  [15]  . 
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3.5  Operating  Characteristic  Functionals 

Definition  28  (Similar  Families  of  Classification  Systems).  Two  families  of  classifica¬ 
tion  systems  A  and  IB  are  called  similar  if  and  only  if  they  operate  on  the  same  cr-field  and 
their  output  is  the  same  well-defined  label  set. 

Suppose  we  have  a  fixed  classification  category  LE,  and  let  A  be  an  object  in  this  cat¬ 
egory.  Then  for  L  consisting  of  k  labels,  there  exists  a  vector  in  (n  =  k2  —  k)- ROC  space 
described  by  an  n-vector  va,  where 


va  =  (P2|i(A), . . .  ,Pk\i(A), . . .  ,pfc-i|fc(A)). 


The  proof  is  self-evident  since  E  is  a  sample  space.  We  call  this  vector  the  operat¬ 
ing  characteristic  vector,  and  we  let 


1/  =  [Va  |  A  E  Ob(LE)}  (3.5) 

and 

v  =  OCLE  =  (&(V),Ar(V),Id(V),o),  (3.6) 

where  5P(V)  is  the  power  set  of  V.  The  category  OCle  is  the  category  of  operating 
characteristic  families  with  undetermined  non-identity  arrows  (we  will  determine  them 
presently).  Now,  consider  the  category 

C  =  (^(Ob(LE)),Id(LE).Id(LE),o) 

whose  objects  are  sets  of  classification  systems.  Then  A  E  Ob (C)  for  each  family  of 
classification  systems  A.  Let 

$  ■■  C  —  V  (3.7) 
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be  an  operating  characteristic  functor,  which  maps  power  sets  of  classification  systems  to 
the  set  of  operating  characteristics  associated  with  them.  Let 

£  :  V — >V  (3.8) 


be  a  functor  where  V  is  a  poset,  thought  of  as  a  category  induced  by  a  partial  order,  >,  of  its 
elements.  Then  £  is  a  functor  taking  objects  consisting  of  sets  of  operating  characteristics 
into  a  value  of  V.  We  do  not  need  to  define  the  rule  at  this  point.  Let  A0,  Ai  £  C,  such 
that 

5(Ao)  =  /a0 

and 

3(A0  =  fAl 

where  the  outputs  are  families  of  operating  characteristics.  Then  the  diagram 


£(/a0)  =  Po 


9 

V 


£  r 

-^C(/aJ  =Pi 


where  p0,  Pi  £  P,  commutes  for  some  unique  (up  to  isomorphism)  g.  This  g  is  an  induced 
partial  order  on  V.  Thus,  for  every  pair  of  families  of  classification  systems,  A0,  Ax  £  C, 
we  have  that  the  rectangle 


=  Po  (3.9) 

IV 

=  Pi 

commutes  when  we  impose  the  criterion  A0  V  Ai  iff  (£  o  ^)(A0)  >  (£  o  ^)(Ai),  so  that 
the  functor  £  o  J  is  a  natural  transformation.  It  is  precisely  the  arrows  like  g,  which  make 


A0  —  S(A0)  =  /Ao^-£(/a0) 


IY 


A1^^(A1)  =  /Al-U£(/Al) 
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such  rectangles  commute,  that  belong  in  the  category  V.  It  is  also  the  arrows  induced 
from  the  partial  order  which  provide  unique  maps  from  one  classification  family  to 
another,  which  will  allow  us  to  define  the  fusion  process  in  Section  IV. 
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IV.  An  Optimization  for  Competing  Fusion  Rules 

4.1  Bayes  Optimal  Threshold  (BOT)  in  a  family  of  classification  systems 

4.1.1  Two-class  BOT.  Let  (E,  S’,  //)  be  a  probability  space  and  A  be  a  family  of 
sample  functions  (classification  systems)  with  parameter  space  0.  Let 

{E1?E2  :  E1;  E2  G  S’} 

be  a  partition  of  E,  and  L  =  {£1:  f2}.  It  is  well-known  and  accepted  that  the  threshold  for 
which  the  probability  of  a  misclassification  (or  Bayes  error)  is  minimized  is  considered 
best  and  denoted  the  Bayes  optimal  threshold  (BOT).  That  is,  if  Ag*  G  A  with  0*  G  0 
minimizes  the  quantity 

//((4(^)n  e2)u(aJ(4)oe1))  =  p(A*(e1)nE2)  +  ii(Al(e2)nE1) 

=  Pl\2(Ag)  p(E-2)  +  P2\l(Ag)  p(Ei) ,  (4.1) 

where  yu(Ei)  and  //(E2)  are  the  prior  probabilities  of  class  1  and  class  2,  respectively. 
Then  6*  is  the  BOT  for  the  family  of  classification  systems  A. 

4.1.2  N-class  BOT.  Now  let  us  keep  the  assumptions  of  the  previous  section  with 
the  exception  that  we  now  have  k  classes  to  consider,  and 

(Ei,  E2, . . . ,  E/,,  :  Ej  G  $  V  %  —  1,  2, . . . ,  k} 

is  a  new  partition  of  E  into  k  classes,  with  L  =  {£1,£2,  ■  ■  ■ ,  C}  a  label  set  corresponding 
to  the  partition  of  classes.  Then  the  corresponding  Bayes  Optimal  Threshold,  6*  G  0, 
where  0  is  now  k  —  1  dimensional  would  be  the  parameter  which  minimizes 

k  k 

Berr  =  ^  ~  Shj)Pi\j  (^)/^(Ej)  j  (4-2) 

i=l  j= 1 
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where 


0  if  i  f  j 
1  if  i=j 


kj  = 


4.2  An  Optimization  over  ROC  m-manifolds  for  competing  fusion  rules 

4.2.1  ROC  m-manifold  optimization.  The  method  used  in  this  section  applies  and 
extends  that  of  [31].  Let  k  G  N,  k  >  1  be  given,  with  m  =  k2  —  k.  Let 

Xm+ 1  f  +1  j  *^2  j  •  •  •  j  •t'm) 

be  the  equation  of  the  ROC  m-manifold.  Then  define 

^(x1,x2r...,xm+1)  =  f(x i,x2,...,xm)  -xm+1. 


Let  201  =  {(xi,x2,  •  •  •  ,xm+i)  :  ^(xi,x2,  ■  ■  ■  ,xm+ 1)  =  0}  be  the  ROC  m-manifold.  As¬ 
sume  R(0)  =  (0, 0, . . . ,  0,  0).  Then  there  is  tf  e  [0, 1]  such  that  R(t/)  G  971,  with  tf 
dependent  upon  the  particular  R.  We  assume  all  first-order  partial  derivatives  exist  and 
are  continuous  for  T.  For  each  t  G  [0, 1]  let  R (t)  =  (Xi(t),  X2(t), . . . ,  Xm+1  (t))  be  a 
smooth  trajectory  that  starts  at  the  initial  point  (0,  0, ... ,  0, 1)  and  terminates  on  the  man- 

m+1 

ifold  201.  Choose  weights  a*  >  0  for  i  =  1,  2, . . . ,  m  +  1  such  that  ^  a%  =  1,  and  let 

2—1 

||  .  || w  represent  the  weighted  fi(Mm+1)  norm  defined  on  V  =  (iq,  v2, . . . ,  vm+i)  by 

m+1 

l|V||w  =  X;ai|ui|.  (4.3) 

2—  1 


Define  the  functional  J 


J[R]  =  [  *  \\R(t)\\w  dt. 
Jo 


(4.4) 
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Theorem  4  (Thorsen- Oxley).  Given  k  classes,  and  a  ROC  m-manifold,  where  m  +  1  = 
k2  —  k  is  the  number  of  possible  types  of  errors  in  classification,  and  given  weights 

m+1 

a*  =  ctat,  a  cost  times  a  prior  probability  ( where  a*  =  1),  then  the  Bayes  Optimal 

2—1 

Threshold  corresponds  to  the  point,  p,  on  the  ROC  m-manifold,  f,  where 


V/(p)  —  - (ai,  a2, ,  am+ 1) . 

®m+ 1 


(4.5) 


Proof:  For  ease  of  notation,  define 


G(t,  R(t),  R(f))  =  ||R(f)||w 

and  let  Yfit)  =  Xf  t  )  for  each  i.  Hence  we  write  Equation  4.4  as 

J[  R]  =  [  1  Gdt  (4.6) 

Jo 

and  we  will  suppress  the  integrand  variables.  We  would  like  to  minimize  J,  so  let’s  find 
R(f)  with  initial  and  terminal  points  as  discussed  which  minimizes  the  functional. 

Let  a  £  [— (3,  j3\  where  [3  £  M,  (3  >  0  be  a  family  of  real  parameters.  Let 


(R(t,a)  =  (X1(t,a),X2(t,a), . . .  ,Xm+1(t,a))  :  a£  [-(3,(3]} 


(4.7) 


be  a  family  of  one-parameter  trajectories  which  contains  the  optimal  curve  R*(f).  Lur- 
thermore  we  assume  that  at  a  =  0  R(t,  0)  =  R*(f).  Let  R(f/,  a)  £  Wl.  By  the  Implicit 
Lunction  Theorem,  there  is  a  function  Tf(a )  such  that  R.(T/  (a),  a)  £  DJt  for  all  a.  Thus 
R(tpO)  =  R *(t})  so  that  T/(0)  =  t Assume  R*(t)  minimizes  J,  then  a  necessary 
optimality  condition  is  that  the  first  variation  of 


J[R(-,  a) 


(4.8) 


59 


be  equal  to  zero  at  a  =  0.  That  is, 


d 

da 


J[R(-,  a)  a=0 


0. 


(4.9) 


We  use  the  notation 


S  —  —  |a=o 
da 


for  brevity.  Applying  Leibniz’s  rule  to  the  derivative  of  Equation  4.8  we  get 


SJ[ R*]  =  G*\t=t*  STf  +  /  (VXG*  •  5R  +  VyG*  ■  SR)dt. 

1  '  J  o 


(4.10) 


where  G*  is  a  suppressed  notation  for  G(t,  R*  (t),  R*  (f)).  Now  integrating  by  parts  yields 


d 


5J[R*]  =  G* \t=t*  STf  +  [VyG*  •  5R$  +  [  \wxG*  ■  5R  -  ^VyG*  ■  SR)dt.  (4.11) 
1  Jo  dt 


At  a  =  0  we  have  the  necessary  optimality  condition 


rt} 


d 


5J[R*]  =  G* \t=t}  5Tf+[X7yG*-5R]t=t}+  /  ( VXG* AR- — VyG* -5R)dt  =  0.  (4.12) 


Since  this  must  be  true  over  all  admissible  variations,  we  have  the  Euler  Equations 


d 

dt 


VXG*  -  —  Vy G*  =  0. 


(4.13) 


for  all  t  6  [0,  t*f]  and  a  transversality  condition 


G*\t=t*  5Tf  +  [VyG*  •  SR]t=t}  =  0. 


(4.14) 


Solving  the  Euler  Equation  4.13,  we  have  VXG*  =  0,  which  implies 


d 

dt 


VyG* 


0, 


(4.15) 
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hence, 


|sg„(y-W)  =  o 


(4.16) 


for  i  =  1, 2, . . . ,  m  +  1,  where  sgn(Z)  returns  the  value  of  —1,  0,  or  1,  depending  on  the 
sign  of  the  function  Z.  Thus,  for  each  i  —  1,  2, . . . ,  m  +  1,  we  have 

sgn  (Y?{t))  =  ki  (4.17) 

Hence,  sgn(Y/*(f))  =  ki  for  some  kt  <G  {  —  1, 0, 1  j.  Thus,  for  all  i,  we  have  A X*(t)  >  0 
for  all  t  and  At  >  0  for  all  t,  so  that  kt  —  1.  It  is  clear  that  AX*  it)  =  0  is  not  optimal 
given  the  initial  and  terminal  conditions.  Thus,  we  have  that 

sgn(l7(f))  =  sgn  (Y*(t))  =  ...=  sgn(F„;(t))  =  -  sgn  (T^+1(f))  =  1. 

Now  R (Tf(a),  a)  terminates  on  971,  so  'T(R(T/(q;),  a))  =  0  for  all  a.  Let  R *(tf)  = 
x*m+1)  E  m.  Hence, 

Xm+i  (' Tf(a ),  a )  =  f(X1(Tf(a),  a ),...,  Xm(Tf(a),  a))  (4.18) 
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Rearranging  terms,  rewriting  in  vector  notation,  and  letting  f*  =  f(x\, . . . ,  x*m)  we  have 


dJl  °f* 

dx\  ’ ' ' '  ’  dxm.-i  ’  J 


+ 


d_r_  or 

&ci’  dxm-i 


0, 


(4.21) 


which  can  be  rewritten 


W*  •  H (t*f)  +  W*  •  R *(t})STf  =  0. 


(4.22) 


From  Equation  4. 14  we  write 


VyG* \t}  ■  H (t*f)  +  G*  | t}5Tf  =  0.  (4.23) 

Since  both  Equations  4.22  and  4.23  must  be  true  over  all  variations  and  all  possible  one- 
parameter  families,  we  have 

«VyG*|t;  =  W*|t.  (4.24) 

for  some  kgR.  Hence,  for  i  —  1,  2, . . . ,  m  +  1  we  have 

1 t=t}  =  Kcii  sgn(Y*)(t}).  (4.25) 

In  the  case  of  i  —  m  +  1  we  have  that 


Thus,  we  have  that  k 


-l 

am-\- 1 


(  d^* 

dxm+ 1 

Hence  for  i  = 


t=t*j 

=  1,  2, . . . ,  m  we  have  that 


— t=rt 

ox*  / 


Q>i 

®m+ 1 


(4.26) 


(4.27) 
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This  leads  to  the  result  that 


W*U  — - (ai,  a2, ... ,  ffim+i)  (4.28) 

®m+ 1 

is  a  normal  to  the  ROC  m-manifold  071  at  the  terminal  point  of  the  smooth  tra¬ 

jectory  minimizing  ./!  This  is  a  global  minimum,  since  we  are  optimizing  a  convex 
functional  [31].  This  agrees  with  the  limited  approach  based  on  observation  taken  by 
Haspert  [18]. 

The  equation  of  the  plane  perpendicular  to  this  normal  and  tangent  to  the  ROC  mani¬ 
fold  at  the  optimal  point  is 

aiOo  -  xj)  +  a2(x  2  -x*2)  +  ...  +  am+1(xm+1  -  x*m+1)  =  0.  (4.29) 

❖ 

To  find  (xj,  x2, . . . ,  x*m+  l ) ,  generate  the  plane  a\Xi  +  a2x2  +  . . .  +  am+1xm+1  =  0, 
which  passes  through  the  origin,  and  translate  it  to  the  ROC  m-manifold  towards  the 
point  (1, 1, ... ,  1)  until  the  plane  rests  tangent  to  the  ROC  manifold  at  a  single  point. 
This  point,  (x\,x2, . . . , x*n+1),  is  the  terminal  point  of  R Now,  recall  that  m  = 
k 2  —  k.  We  associate  the  k  with  the  number  of  classes  in  a  classification  problem,  so 
that  there  is  a  label  set  of  interest  with  cardinality  k,  and  a  sample  space  with  a  partition 
of  cardinality  k  associated  with  these  labels.  Let  l  =  1,2, ...  ,k,  and  for  each  l  let 
r  =  1,2 ,...  ,k,  r  ^  l.  Then  associate  with  each  i  =  1, 2, . . . , m  an  unique  (/, r)  pair. 
Designate  for  each  Xi  that  Xi  =  pi\r(Ae )  for  some  Ae  e  A,  a  family  of  sample  functions 
(classification  systems).  Thus,  the  i  variables  represent  the  error  axes  of  the  A'-class 
classification  problem.  Similarly,  we  can  designate  costs  (or  losses)  for  each  error  by 
allowing  a*  =  c^r  for  each  i.  Then  the  sum, 

k  k 

~Si,r)ci,rPi\Me),  (4.30) 

1=1  r=  1 
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is  the  equation  for  Bayes  Risk,  or  Bayes  Error  in  the  case  where  Q.r  =  1  for  each  (/,  r) 
pair.  We  then  have  for  the  specific  x*,  that  there  exists  6*  G  0,  such  that 

x*=Pi,r{Ao*),  (4.31) 

so  that  6*  is  the  Baye’s  Optimal  Threshold  (or  Baye’s  Optimal  Risk  Threshold)  for  the 
k-c lass  classification  problem.  When  these  correspondences  can  be  made  (along  with 
the  appropriate  dimensionality  of  the  parameter  space  0  and  the  ROC  manifold  911),  it  is 
clear  that  our  optimal  trajectory,  R terminates  on  the  ROC  manifold  corresponding 
to  the  Baye’s  Optimal  Threshold  (or  Baye’s  Optimal  Risk  Threshold).  Therefore,  we  have 
shown  how  a  ROC  manifold  can  be  analyzed  to  find  the  point  corresponding  to  the  Baye’s 
Optimal  Threshold. 

4.2.2  ROC  1-manifold  optimization  ( Optimizing  the  ROC  curx’e).  Here  we  demon¬ 
strate  the  optimization  of  ROC  1-manifolds,  referred  to  in  this  section  as  ROC  curves.  We 
demonstrate  that  the  technique  shown  in  the  previous  section  applies  to  the  case  of  the 
two-class  problem,  with  the  ROC  curves  having  the  axes  typical  in  the  literature-a  true 
positive  axis  in  the  vertical  direction  and  a  false  positive  axis  in  the  horizontal  axis.  We 
will  only  consider  ROC  curves  that  are  smooth  (differentiable)  over  the  entire  range,  i.e., 
we  consider  the  set 

C1  ( [0, 1],  M)  =  {/  :  [0, 1]  — ■>  M  :  /  is  differentiable  at  each  x  G  (0, 1) 
and  its  derivative  f  is  continuous  at  each  x  G  [0, 1]}. 

Given  a  diagram  describing  the  family  of  classification  systems  A  =  {Aq  :  6  G  0},  with  0 
a  continuous  parameter  set  (assumed  to  be  one  dimensional),  and  (E,  S’,  p)  a  probability 
space  of  features,  there  is  a  set  rA  =  {(9,p2\i(Ad),pi\i(Ae))  :  6  G  0}  which  is  called 
the  ROC  trajectory  for  the  classification  system  family  A.  The  projection  of  the  ROC 
trajectory  onto  the  (p2|i,Pi|i)-  plane  is  the  set  fA  =  {(p2\i(A9) , pi\i(Ae))  :  6  G  0}  which 
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is  the  ROC  curve  of  the  classification  system  family  A.  Hence,  for  h  e  [0, 1]  such  that 
h  =  p2\i(Ag)  for  some  @6  0,  we  have  that 


M*(W)  =  {Aa}, 


that  is,  the  pre-image  of  h  under  p2|i (•)  is  the  classification  system  Ag,  which  we  assume 
has  a  one-to-one  and  onto  correspondence  to  9.  Therefore,  the  BOT  of  the  family  of 
classification  systems  A,  denoted  by  9*,  corresponds  to  some  h*  =  p2\i(Ae*)  e  [0,1], 
which  may  not  be  unique,  unless  the  function  p2|i (•)  is  one-to-one.  So,  there  is  at  least 
one  such  h*,  now  what  can  we  leam  about  it?  Consider  the  problem  stated  as  follows: 


Let  a,/3  >  0.  Among  all  smooth  curves  whose  endpoints  lie  on  the  point 
(0, 1)  and  the  ROC  curve  given  by  y  =  /(X(f)),  find  the  curve,  defined  by 
the  trajectory  R(t)  =  (X(t).  Y (t)) ,  for  which  the  functional 

\\K\\wdt=  [  [a\X{t)\+P\Y(t)\]dt  (4.32) 

Jo 

has  a  minimum  subject  to  the  constraints: 


R(o)  =  (o,i) 

m)  =  C h,f(h )), 


(4.33) 


for  some  he  [0,1]  that  depends  on  R.  We  let  X  (f)  : 
and  denote  W  =  X(  t)  and  Z  =  Y(t),  so  that  X(t) 
becomes 


J[R]  =  /  [a  +  P\Z(t)\]dt. 


t  due  to  the  constraints 
=  1,  and  Equation  4.32 

(4.34) 


Observe  that  h  =  pi|2(Ae)  ,  f[h)  =  pi\i(Ae)  for  some  @6  0,  and  ft  = 
/ /  ( E  | )  —  l  —  a  with  a  =  p(E2),  the  prior  probability  of  a  class  2  occurrence. 


The  functional  J,  when  minimized,  identifies  the  trajectory  with  smallest  arclength 
(measured  with  respect  to  the  weighted  1-norm).  The  constraints  of  Equation  4.33  require 
that  the  curve  must  begin  at  (0, 1)  and  terminate  on  the  ROC  curve.  The  integrand  of 
Equation  4.34  can  be  written  in  a  suppressed  form 


G(t,X(t),Y(t),X(t),Y(t))  =  G  (t,X,Y,W,Z),  (4.35) 
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so  that  the  partial  derivatives  are  more  easily  understood.  In  the  case  where  X (t)  =  t, 
then  X(t)  =  1  and  we  have  that  Equation  4.35  can  be  further  suppressed: 

G (t,  Y,  Z)  (4.36) 

Any  R  that  minimizes  J,  subject  to  the  constraints  4.33,  necessarily  must  be  a  solution  to 
Euler’s  Equation  [13] 

O  7  0 

—  G(f,  Y,  Z )  -  -  — G (t,  Y,  Z)  —  0  for  all  t  E  (0,  h).  (4.37) 

From  Equation  4.32  we  have  G (t,Y,  Z)  =  a  +  (3\Z\,  so  that  ^G  =  0  and  ^ G  = 
/ 3  sgn (Z).  Hence,  we  have  that  R  solves  the  Euler  equation 

— -^-sgn (Z(t))  =  0  forallt  E  (0 ,h).  (4.38) 

Integrating  this  equation  reveals  that  sgn [Z (t)  )  is  constant  for  all  t  E  [0,  h\ .  Since  Yit)  < 
1  for  all  t  E  (0,  h),  and  F(0)  =  1,  from  Constraints  4.33,  then  sgn (Z(t))  must  be  0  or  — 1, 
since  the  trajectory  is  moving  either  constantly  across  to  the  curve  or  constantly  downward 
from  the  point  (0, 1)  .  Now,  if  sgn(Z(f))  =  0  for  all  t,  then  1  =  y(0)  =  Y{h)  =  Y(  1) 
due  to  the  smoothness  of  the  ROC  curve.  Substituting  this  solution  into  the  functional  J 
in  Equation  4.32  yields 

J[R]  =  ah  =  n(E2)pi\2(Ae),  (4.39) 

with  pi\2(Ag)  =  1.  Thus,  J[R]  =  /r(E2)  and  the  weighted  (1-norm)  arclength  of  curve 
R  is  therefore  /r(E2).  On  the  other  hand,  if  sgn(Z(7))  =  —1  for  all  t  E  (0 ,h),  then 
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\Z(t)  \  =  — Z(t )  and  substituting  this  into  J  directly  in  Equation  4.32  yields 


J[R]  =  /  [a-pZ(t)\dt  (4.40) 

Jo 

=  ah  +  P\Y(0)-Y(h)] 

=  ah  +  [l-Y{h)}P 
—  Pi\2(Aq)  ii(Y2)  +  (l  —  pi|i(v4e))  /r(Ei) 

=  pi\2(A0)n(E2) +p2\i(Ae)n(E1).  (4.41) 

Notice  that  Equation  4.41  is  identical  to  the  unminimized  Bayes  Optimal  Threshold  equa¬ 

tion.  Therefore,  h  =  h*  which  minimizes  Equation  4.41  corresponds  to  the  BOT,  6*,  of 
the  family  of  classification  systems,  A.  The  transversality  condition  [13]  of  this  problem 
is 

a  +  P\Z(t) |  | t=h*  +P(f'(t)  ~  Z(t))sgn(Z(t))  \t=h.=  0  (4.42) 

so  that 

f'(h*)  =  ^  (4.43) 

which  is 

rm  =  (4.44) 

^(Ei) 

So  the  transversality  condition  tells  us  that  the  BOT  of  a  family  of  classification  systems 
corresponds  to  a  point  on  the  ROC  curve  which  has  a  derivative  equal  to  the  ratio  of  prior 
probabilities, 

Me2) 

(‘(Ei)' 

Therefore,  if  one  presumes  a  ratio  of  prior  probabilities  equal  to  1,  then  the  point  on  the 
curve  corresponding  to  the  BOT  will  have  a  tangent  to  the  ROC  curve  with  slope  1.  We 
could  substitute  a  =  Ci|2/i(E2)  and  p  =  where  C*i|2  and  C2\\  are  the  costs  of 
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making  each  error,  or  we  could  specify  a  cost-prior  ratio 


C1I2ME2) 

C2\irt  Er)’ 

if  we  wish  to  consider  costs  in  addition  to  the  prior  probabilities.  This  gives  us  an  idea 
of  what  would  make  a  good  functional  for  determining  which  families  of  classification 
systems  are  more  desirable  than  others.  An  immediate  approach  would  be  to  choose  a 
preferred  prior  ratio  and  construct  a  linear  variety  through  the  optimal  ROC  point  (the 
point  (0, 1)  for  the  typical  two-class  ROC  manifold  classification  problem,  the  origin  in 
the  k  >  2  class  case.).  Then  for  each  point  on  the  ROC  curve,  take  the  2-norm  of  the 
vector  which  minimizes  the  distance  from  this  point  to  the  linear  variety.  If  we  knew  the 
function  generating  the  ROC  curve  (or  a  ROC  manifold),  we  could  calculate  the  optimal 
ROC  directly,  but  this  is  not  the  case  in  practice. 

It  is  still  possible  that  many  ROC  curves  could  be  constructed  so  that  the  point  on 
the  ROC  curve  corresponding  to  the  BOT  for  each  one  has  the  same  distance  to  the  linear 
variety.  This  could  be  a  rather  large  equivalence  class  of  families  of  classification  systems. 
This  is  similar  to  the  problem  faced  when  using  area  under  the  curve  (AUC)  of  a  ROC 
curve  as  a  functional.  In  both  cases  the  underlying  posterior  conditional  probabilities 
are  unknown  and  there  are  just  too  many  possible  combinations  of  posterior  distributions 
that  can  produce  ROC  curves  with  the  same  AUC  (or  equal  BOT  functional  values).  The 
point,  however,  is  that  using  a  functional  based  on  the  BOT,  we  would  have  a  leveled 
playing  field  since  we  are  debating  which  ROC  (and  therefore  the  classification  system  it 
represents)  is  better  based  on  the  same  prior  probabilities.  AUC  equivalence  classes  are 
over  the  entire  range  of  possible  priors  and  therefore  of  less  value.  Furthermore,  the  AUC 
functional  does  not  relate  its  values  to  the  unknown  priors  at  all.  Rather,  it  is  related  to  the 
value  of  the  class  conditional  probabilities  associated  with  a  classification  system  over  all 
possible  false  positive  values.  It  is  therefore  essentially  useless  as  a  functional  in  trying 
to  discover  an  appropriate  operating  threshold  for  a  classification  system. 
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4.3  A  Category  of  Fusors 

4.3.1  A  Functional  for  Comparing  Families  of  Classification  Systems.  We  desire 
a  method  to  compete  families  of  classification  systems  with  the  specific  intent  to  compete 
fusion  rules.  We  show  explicitly  how  to  do  this  with  n  —  2  classes.  Although  we  are 
proposing  one  specific  functional  on  the  ROC  curve  to  do  this,  other  functionals  can  be 
developed  as  well.  Ultimately,  once  the  functional,  along  with  its  associated  data  is 
chosen,  one  has  a  way  of  defining  fusion  (and  what  we  call  fusors)  for  the  given  problem. 

Let  n  G  N  be  the  number  of  classes  of  interest,  and  m  =  n 2  —  n.  We  construct  the 
functional  over  the  space  X  =  U([0,  l]m-1,M)  n  O1((0,  l)m-1,M),  recognizing  that  we 
are  competing  ROC  curves,  which  are  by  definition  a  subset  of  X.  The  functional 


F2  :  X  ->  R, 


where  n  —  2  is  the  number  of  classes,  is  denoted  F2f;  71, 72,  a,  (3)  for  the  ROC  curves 
corresponding  to  a  two-class  family  of  classification  systems,  where  71  =  C2\\  Pr(fi)  is 
the  cost  of  the  error  of  declaring  class  E2  when  the  class  is  truthfully  E|  times  the  prior 
probability  of  class  Ei,  72  =  C\\2  Pr(£2)  is  the  cost  of  the  error  of  declaring  class  Ei 
when  the  class  is  truthfully  E2  times  the  prior  probability  of  class  E2,  while  a  =  P\\2  and 
[3  —  P \\i  are  the  acceptable  limits  of  false  positive  and  true  positive  rates.  Without  loss 
of  generality,  we  assume  71  to  be  the  dependent  constraint.  The  quadruple  (71, 72,  a,  (3) 
comprises  the  data  of  the  functional  F2. 

Definition  29  (ROC  curve  Functional).  Let  (71, 72,  cu,  (3)  be  given  data.  Let 


R}. 


yo  = 


0 

1 


7i 
(  72 


and 


Ur  =  {v  |  v  =  AT.  V  k  G 
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Then  Vr  +  yo  is  a  linear  variety  through  the  supremum  ROC  point,  (0, 1),  over  all  possible 
ROC  curves,  under  the  data.  Let  /  e  X  and  let  /  be  non-decreasing.  Let  ffl(f)  be  the 
range  of  /,  and  let 

T  =  ([0,  a]  x  [P,  1])0^(/). 

Let  =  niin  ||v  +  y0  —  y||9.  Then  define 

vgVr  ' 

yeT 

A2(-;7i,72,«,/3)  :  X  ->  M 


by 

W;  71, 72,  a,  P)  =  V2  -  zr,  V  /  e  X.  (4.45) 

It  should  be  clear  that  the  constant  y/2  is  the  largest  theoretical  distance  from  all  linear 
varieties  to  a  curve  in  ROC  space. 

So  far,  it  is  shown  that  Fn  is  minimal  at  the  Bayes  optimal  point  of  the  ROC  curve 
under  no  constraints  restricting  the  values  possible  for  it  to  take  in  ROC  space  (i.e.  ,  a  =  1 
and  (3  =  0  in  the  2-class  case,  and  a  =  (1, . . . ,  1)  in  the  n-class  case).  We  can  now  relate 
this  functional  to  the  Neyman-Pearson  (N-P)  criteria.  Recall  that  the  N-P  criteria  is  also 
known  as  the  most  powerful  test  of  size  a0,  when  a0  is  the  a  priori  assigned  maximum 
false  positive  rate  [45].  Given  a  family  of  classification  systems  A  =  (A©  :  6  e  0},  the 
N-P  criteria  could  be  written  as 

maxP1|i(Ae)  subject  to  Pi\2(Ag)  <  a0. 


Theorem  5  (ROC  Functional-Neyman-Pearson  Equivalence).  Let  y,  be  the  dependent 

2 

constraint,  and  y*  <  1.  The  ROC  functional  F2(-;  yi,  y2,  a,  f3)  under  data  (1,  0,  «o,  0) 
2=1 

yields  the  same  point  on  a  ROC  curx’e  as  the  Neyman-Pearson  criteria  with  a  <  a0. 
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Proof:  Suppose  (71, 72,  a,  (3)  =  (1,  0,  a0,  0).  Then  T  =  (1,  0)  and 


Vr  =  <  v 


v  = 


k 

0 


V  k  e 


and  let 


yo  = 


'  0  ' 


1 

v  / 


Thus,  Vr  +  yo  is  the  appropriate  linear  variety.  Let 


T  =  ([0,a0]x[0,l])n^(/), 

where  /  is  a  ROC  curve  and  consider  (3N  e  /([0,  a0])  as  the  optimal  point  in  the  image  of 
/  under  the  N-P  criteria.  Then  zn  =  1  —  Pn  is  the  distance  to  Vf  +  yo-  Now, 


F2(f)  =  V2-zT, 


where 


zr 


min  ||v  +  y0 -y||2. 
vGVr 

Y6T 


Thus,  we  have  that  (3N  >  /3,  V  (3 
/(a),Va<a0-  Then  for 


=  /(a),  V  a  <  a0. 


y  n  = 


5 


Hence,  1  —  —  (3,  V  (3  — 
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we  have  that 


(1  —  Pn)2 


(  > 

(Xn 

-  y  n 

1 

a 

P 


' 

(  > 

CUN 

CUN 

1 

,  PN  , 

V  (3  —  /(a),  V  a  <  a0.  Thus,  letting  y  = 


a 

[P  J 


we  have  that 


1  c  \ 

f  > 

a 

<  min 

a 

min 

-  yjv 

-  y 

Q;<Q;o 

1 

V  > 

CX^Oto 

i 

V.  / 

ye[0,ao]x/([0,ao]) 

mm 

a<Q:o 

ye[0,a0]x/([0,ao]) 

min 

veVr 

ye[0,Q0]x/([0,o0]) 


/  \ 


a 

0 

v  +  y0  -  y 


0 

1 


(4.46) 

(4.47) 

(4.48) 


On  the  other  hand. 


min  || v  +  y0  -  y 
vetp 

ye[0,ao]x/([0,«o]) 


<  min  ||v  +  y0  -  yjv|| 

veVr 


< 


aN 

1 


-y  n 


o 

1  —  Pn 
=  1  —  pN. 


(4.49) 

(4.50) 

(4.51) 

(4.52) 
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Therefore,  we  have  that 


zr  =  min  ||v  +  y0  -  y||  =  1  -  (3N. 

vgVr 

ye[0,ao]x/([0,ao]) 

But  zr  —  1  —/3r,  where  (3r  is  the  optimal  point  in  the  image  of  /  under  the  ROC  functional, 
so  that  (3r  =  3n-  So,  we  have  that  the  ROC  functional,  under  data  (1,  0,  a0,  0),  acting  on 
a  ROC  curve  corresponds  to  the  power  of  the  most  powerful  test  of  size  c*o.  <} 

This  idea  can  be  extended  to  the  k  >  2-class  problem  by  setting  a  maximum  accept¬ 
able  error  rate  am  for  each  of  the  rn  —  1  independent  error  axes,  where  m  =  k2  —  k. 

4.3.2  The  Calculation  and  Scalability  of  the  ROC  Functioned.  The  calculation 
and  scalability  of  the  functional  is  straightforward.  Suppose  we  have  k  classes.  In  the 
two-class  case,  one  axis  is  chosen  as  Piy,  but  in  the  k-c lass  case,  each  axis  is  an  error 
axis.  This  is  absolutely  necessary  in  the  case  where  costs  of  errors  differ  within  a  class. 
If  we  apply  this  methodology  to  the  two-class  case,  the  two  axes  would  be  Py2  and  P2|i 
with  the  ROC  curve  starting  at  point  (0, 1)  and  terminating  at  point  (1,  0).  A  ROC  at  the 
origin  would  represent  the  perfect  classification  system  (the  supremum  ROC)  under  this 
scheme.  We  choose  the  conditional  class  probability  Pk\k-i  to  be  the  dependent  one.  Let 
m  =  k2  —  k.  Let  d  =  (71, . . . ,  ym,  07, . . . ,  am)  be  the  data,  and  let  each  r  =  1,2, ...  ,  m 
be  associated  with  one  of  the  m  pairs,  (i,  j),  where  for  each  i  =  1,  2, . . . ,  k  with  i  j,  we 
have  a  j  —  1,  2, . . . ,  k.  Let  am  be  associated  with  Pk\k-i-  Let  q  =  (ql7 . . . ,  qm ).  Then  let 

Q  =  {q  |  hr  =  Pi\j ,  r  =  1,2,...,  m,  pi\j  <  ar,  r  m;  i,  j  =  1,  2, . . . ,  k;  i  f  j)  (4.53) 

be  the  set  of  points  comprising  the  ROC  curve  within  the  constraints.  Then  we  have 
that  y0  =  (0,  0, ... ,  0)  and  N  =  ^ (7!, . . . ,  ym),  so  that  if  we  are  given  the  ROC  curve 
represented  by  the  set  Q,  call  it  /q,  we  have  that 

F„(/0; d)  =  V2  -  min  {  (q  }  =  ^~  mm  {  (q, -n)  }, 
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Figure  4.1:  Geometry  of  calculating  the  ROC  functional,  F2,  for  a  point  (with  vector  q) 
on  ROC  curve  /c. 

with  n  the  unit  normal  in  the  direction  of  N,  when  0  is  not  empty,  and 

KUq;  d)  =  0, 

otherwise.  The  notation  (•,  •)  is  the  scalar  product.  Figure  4.1  shows  the  geometry  of  the 
ROC  functional  calculation  where  the  number  of  classes  is  n  =  2,  and  the  given  data  is 

(7,oc). 

4.4  The  Min-Max  Threshold 

Suppose  we  are  given  a  ROC  m-manifold  with  m  =  k2  —  k.  This  can  be  viewed  as 
a  set  of  class  conditional  probability  m-vectors  A=  {ct(t)  :  t  G  T}  that  form  a  continu¬ 
ously  differentiable,  non-increasing  function  in  ROC  m-space.  Let 

Ac  =  { I  —  a(t)  :  a(t)  G  A},  (4.54) 
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where  I  is  the  appropriate  identity  vector.  We  associate  with  each  of  the  m  errors  a  cost 
Ci,  i  =  1,2, ,  m,  so  that  corresponds  to  cr.  There  are  also  k  prior  probabilities, 
pk,  each  prior  having  k  —  1  copies,  so  that  we  can  enumerate  them  and  allow  each  a%(t) 
to  correspond  to  pt,  i  =  1,2 , ...  ,m  where  pt  =  pk  for  each  i  and  some  particular  k. 
Let  S3  =  {ei,  e2, . . . ,  emj  be  the  standard  basis  for  the  linear  space  Mm.  Then  put  T  = 

m 

c,et.  Then  for  any  v  £  Mm  we  have  that 

i=  1 

=  (ciz/1,  c2z/2,  •  •  • ,  cmum)T  (4.55) 

Now  a  risk  is  a  decision  error  times  the  cost  of  such  an  error,  so  that  in  our  vernacular  a 
risk  is  Ty  =  CjCTj.  Hence,  Tra  is  a  risk  vector  and  T(^4)  is  a  risk  set.  Let  a.t  £  M  such  that 

m 

y]  a,;  =  1.  Let 

^  =  <| r  £  :  r  =  ^  (Taj) a,;,Vaj  £  U  *AC  1 . 

Then  ^  is  a  convex  risk  set.  Let  ^  =  (p;  :  i  —  1,2, ... ,  m}.  &  is  a  convex  set.  Now 

consider  also  that 

(r,  p)  =  (Ta,  p)  =  (a,  T*p)  =  (a,  Tp)  (4.56) 

showing  that  T  is  a  self-adjoint  linear  operator  on  Mm.  Since  Mm  is  a  reflexive,  normed 
space,  and  are  convex  subsets  of  M”'  and  Wn*  respectively,  we  have  by  the  Min-Max 
theorem  [31] 

min  T  max  (r.  p)  1  =  max  [  min  (r.  p)  1  (4.57) 

re^LPe^  J  Pe^  L  re^  J 

and  this  occurs  where  r  and  p  are  aligned,  so  that 

(r*,p*)  =  ||r*||  ||p*||  (4.58) 
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for  the  unique  r*  and  p*  which  makes  Equation  4.57  hold  true.  Now  define  A  =  conv(74lJ 
Ar ) ,  so  then  A  is  the  convex  hull  of  the  ROC  monifold  and  its  “compliment”.  Thus, 
A  —  M.  Furthermore,  we  have  that  T(/R)  is  a  convex  subset  of  M'"* ,  and  A  C  R'n. 
Thus,  the  Min-Max  theorem  applies  so  that 

min  \  max  (a,  Tp)  1  =  max  [  min  (a,  Tp)  1  (4.59) 

aeA  L'iPe,r(^)  J  xPeT(^)  L  aeA  J 

which  only  occurs  where  a  and  Tp  are  aligned 


Therefore,  we  have  that 


Hence, 


min  \  max  (r.  p)  1 

r dSt.  L  p£A»  ' J 


}  ||a**||  |Tp**||  • 

(4.60) 

=  min  [  max  (T a,  p)  1 

ZcxgZ(A)  l  pe^  J 

(4.61) 

=  min  \  max  (T «,  p)  1 
«g.4  L  Pe^  J 

(4.62) 

=  min  T  max  (a,  Tp)  1 
c*gT  L  PG^  J 

(4.63) 

=  min  T  max  (a,  Tp)  1 

c*G.4  L  XpG'I(^)  J 

(4.64) 

=  max  [  min  (a,  Tp)  1 . 

•Xpg'X(^)  L  «gT  j 

(4.65) 

—  ra**|  ||Tp**|| 

(4.66) 

=  ll«*ll  l|£p*ll  • 

(4.67) 

So, 


where 


|r*||  =  k  || ck* ||  , 

l|£P.|| 


k  = 


P* 


(4.68) 


(4.69) 
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The  point  of  this  section  is  that  the  minimax  point  on  the  hull  of  the  ROC  manifold  is  now 
shown  to  be  the  point  with  minimum  £2-norm.  This  point  corresponds  to  the  minimax 
point  of  the  convex  risk  set  generated  by  the  self-adjoint  linear  transformation  T  on  the 
ROC  manifold.  This  leads  to  the  conclusion  that  when  a  researcher  is  testing  two  or  more 
families  of  classification  systems,  if  he  has  good  knowledge  of  the  prior  probabilities,  then 
the  ROC  functional,  Fk,  is  the  preferred  functional  with  which  to  establish  which  fusion 
rules  are  fusors.  On  the  other  hand,  if  prior  probabilities  are  not  understood  well,  the 
minimax  threshold  may  be  the  threshold  he  would  want  to  compare  in  order  to  establish 
the  partial  ordering  over  the  fusion  rules  (and  for  defining  the  fusors).  In  this  case,  the 
researcher  would  want  to  compare  values  of  the  functional 

Gfc(Aj)  =  min  ||a||2,  (4.70) 

cxGAj 

for  each  family  of  classification  systems  Ar  There  is  one  caveat  to  the  solution  here. 
This  is  based  on  research  in  [42],  where  it  is  shown  that  if  the  solution  to  Equation  4.70  is 
not  on  the  ROC  convex  hull,  then  a  random  decision  rule  can  be  developed  using  the  two 
closest  points  which  are  on  the  convex  hull,  with  this  random  decision  rule  being  optimal 
to  the  optimizing  argument  of  the  functional  Gk(Aj).  In  other  words,  its  2-norm  would 
be  smaller  then  what  the  family  A j  can  produce. 

4.4.1  Defining  Fusors.  We  are  now  in  a  position  to  define  a  way  in  which  we 
can  compete  fusion  rules.  Suppose  we  have  a  fixed  classification  system  such  as  that  in 
Figure  3.2.  Each  branch  of  the  system  (whether  fixed,  or  associated  with  a  fusion  rule) 
has  a  ROC  manifold  that  can  be  associated  with  the  family  of  classification  systems,  and 
we  now  have  a  viable  means  of  competing  each  branch.  If  we  can  only  choose  among 
the  two  classification  systems,  take  the  one  whose  associated  ROC  functional  is  greater. 
Therefore,  we  can  also  compete  these  two  classification  systems  with  a  new  system  that 
fuses  the  two  data  categories  (or  the  feature  or  label  categories  for  that  matter)  by  fixing 
a  third  family  of  classification  systems,  which  is  based  on  the  fusion  rule,  and  finding  the 
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ROC  functional  of  the  event-to-label  system  corresponding  to  the  fused  data  (features).  If 
the  fused  branch’s  ROC  functional  is  greater  than  either  of  the  original  two,  then  the  fusion 
rule  is  a  fusor.  Repeating  this  process  on  a  finite  number  of  fusion  rules,  we  discover  a 
finite  collection  of  fusors  with  associated  ROC  functional  values.  Since  the  subcategory 
of  fusors  is  partially  ordered,  the  best  choice  for  a  fusor  is  the  fusor  corresponding  to  the 
largest  ROC  functional  value.  Do  you  want  to  change  your  a  priori  probabilities?  Simply 
adjust  7  in  the  ROC  functional’s  data  and  recalculate  the  BOTs  for  each  system.  Then 
calculate  the  ROC  functional  for  each  corresponding  ROC  and  choose  the  largest  value. 
The  corresponding  fusor  is  then  the  best  fusor  to  select  under  your  criteria.  Therefore, 
given  a  finite  collection  of  fusion  rules,  we  have  for  fixed  ROC  functional  data  a  partial 
ordering  of  fusors. 


Figure  4.2:  ROC  Curves  of  Two  Competing  Classification  Systems. 

Definition  30  (Fusor  over  ROC  Manifolds).  Let  I  C  N  be  a  finite  subset  of  the  natural 
numbers,  with  maxi  =  n.  Given  {Aj}iei  a  finite  collection  of  similar  families  of  classi¬ 
fication  systems,  let  be  the  category  of  fusion  rules  associated  with  the  product  of  n 
data  sets.  Let  Fm  be  the  ROC  functional  on  the  associated  ROC  manifolds  of  the  families 
of  classification  systems,  both  original  and  fused,  where  m  =  k2  —  k,  with  k  being  the 
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number  of  classes  of  interest  in  the  classification  problem.  Let  (7,  a)  be  the  established 
data  for  the  problem.  Then  given  that  fAi  is  the  ROC  curve  of  the  /th  family  of  classi¬ 
fication  systems,  and  fa  the  ROC  curve  of  the  classification  family  A^,  associated  with 
fusion  rule  94  e  Ob(C>Q^),  we  say  that 


A  i  a  A  j 


Fmifki)  >  Fm(fA. ) 


(4.71) 


so  that  if  A?r  A  At  for  all  *  €  I,  then  94  is  called  a  fusor. 

There  is  then  a  category  of  fusors,  which  is  a  subcategory  of  0^n\  and  whose  arrows 

are  induced  by  the  ROC  functional,  £,  such  that  given  objects  94  and  ©  of  this  subcategory, 

> 

then  there  exists  an  arrow,  94  -^4  &  if  and  only  if  A^t  A  Ag  if  and  only  if  p^  >  p&.  This 
can  be  seen  in  the  commutativity  of  the  rectangle  constructed  from  Equation  3.9, 


94 - ^A^h — ^-^(Ajh)  —  fAm  ^  »^(/aw)  —  P<n 


IV 


IT 


9 


IV 


& - ^  As  — 5  >-  $  (Ag)  —  fAe  »  ^(/as  )  —  Pe 


where  we  can  see  that  in  order  for  the  rectangle  to  commute,  that  >  must  be  a  partial  order. 
We  are  now  in  a  position  to  define  the  fusion  processes. 

Definition  31  (Fusion-Rule  Process).  Given  a  fixed  classification  problem  defined  by  the 
category  LE,  a  fusion-rule  process  is  an  element  of  Ob(LE). 

We  didn’t  really  whittle  this  down  from  the  category  of  classification  systems,  because 
a  fusion  rule  could  be  the  rule  “choose  classification  system  X”,  which  doesn’t  necessarily 
give  a  performance  improvement.  The  next  definition  is  the  one  of  interest,  since  it  defines 
the  fusion  with  the  necessary  addition  of  a  qualitative  element. 

Definition  32  (Fusion  Process).  Given  a  fixed  classification  problem  defined  by  the  cate¬ 
gory  Le,  and  a  natural  transformation  from  this  category  to  a  category  defined  by  a  poset 
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P  =  (X,>),  let  FUSlb  be  the  subcategory  of  classification  systems  induced  by  the  par¬ 
tial  ordering.  This  category  has  as  objects  precisely  those  objects  of  LE  which  have  an 
arrow  pointing  to  every  fixed  branch.  We  then  say  a  fusion  process  is  an  element  of 
Ob(FUSLE),  and  we  can  call  this  category  the  category  of  fusion  processes. 

We  have  now  given  a  definition  of  the  fusion  process  which  contains  everything  nec¬ 
essary.  As  an  example,  suppose  we  start  with  the  system 


with  L  a  k-c lass  label  set.  Let  Ag  =  ae  o  pl  o  .s,  and  o  p2  o  s2,  and  consider  a 

functional  Fk  on  the  ROC  curves  fA  and  /b  where  A  and  B  are  defined  as  families  of  the 
respective  classification  systems  shown  (Fk  being  created  under  the  assumptions  and  data 
of  the  researcher’s  choice).  Then,  given  fusion  rules  ©,  such  as  that  in  Figure  4.3,  and  X 
and  a  second  fusion  system 


D2 


let  /e  and  refer  to  the  corresponding  ROC  curves  to  each  of  the  fusion  rule’s  systems 
(as  a  possible  example  of  ROC  curves  of  competing  fusion  rules  see  Figure  4.2  ).  Then 
we  have  that  if  Fk(fe)  >  Fk(fx)  and  Fk(f&)  >  Fk(fv )  and  similarly,  if  Fk(f%)  >  Fk(fA) 
and  Fk(f %)  >  Ffc(/B)  then  we  say  that  6,  X  are  fusors.  Furthermore,  suppose  Fk(f&)  > 
Fk(  f<i).  Then  we  have  that  6  A  X.  Thus,  ©  is  the  fusor  a  researcher  would  select 
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under  the  given  assumptions  and  data.  Figure  4.3  is  a  diagram  showing  all  branches  and 
products  (along  with  the  associated  projectors)  in  category  theory  notation. 


E 


Figure  4.3:  Data  Fusion  of  Two  Classification  Systems. 

4.4.2  Fusing  Across  Different  Label  Sets.  Up  to  this  point,  we  have  considered 
fusing  only  those  branches  of  our  fixed  classification  category.  This  category  had  a  fixed 
event  set  and  a  fixed  label  set.  Sometimes  researchers  have  reason  to  fuse  classification 
systems  which  classify  events  into  different  label  sets  before  fusion  takes  place.  For 
example,  consider  the  classification  of  a  mammogram  by  two  classification  systems,  Ax 
and  A2.  The  first  system  detects  microcalcifications  in  the  breast  and  returns  a  result  of 
cancer  or  non-cancer.  The  second  system  detects  irregular  masses  and  returns  a  result  of 
cancer  or  non-cancer.  While  the  label  sets  look  the  same  (in  fact,  bijective),  they  are  not 
equal.  The  first  partitions  the  event  set  into  two  sets,  one  where  microcalcifications  are 
present  and  one  where  they  are  not.  Obviously,  irregular  masses  can  occur  in  either  set,  so 
that  the  cancer  label  of  system  A\  does  not  correspond  with  the  cancer  set  of  system  A2. 
We  would  still  like  to  fuse  the  results,  but  now  we  must  consider  carefully  what  should 
the  label  set  be?  It  would  be  prudent  to  put  the  label  set  again  as  cancer  and  non-cancer, 


81 


which  is  isomorphic  to  both  the  original  label  sets.  The  new  label  set  could  still  be  cancer 
or  no  cancer,  however,  these  labels  induce  a  new  partition  of  the  event  space  since  we  now 
consider  cancerous  results  to  be  those  where  microcalcifications  or  irregular  masses  are 
returned  by  the  systems.  This  leads  to  two  definitions  developed  by  Drs.  Oxley,  Bauer, 
Schubert,  and  myself  [46]. 

Definition  33  (Consistent  Functor  Category  of  Classification  Systems).  A  functor  cat¬ 
egory  of  classification  systems,  Jz?E,  is  called  consistent  when  there  exists: 

1.  a  probability  space  (E,  S,  /j), 

2.  a  finite  label  set  JZJ  =  £m}, 

3.  a  classification  system  r  G  22?E, 
such  that  the  set  of  sets 


M 

forms  a  partition  of  E.  That  is,  for  r\£i)  =  E,  we  have  that  [^J  E*  =  E  and  E*  (T  Ej  =  0 

i=  1 

for  all  i  7^  j.  In  practice,  the  classification  system  r  referred  to  above,  in  a  consistent, 
fixed  classification  system  is  called  the  “truth”  classifier. 

It  should  be  clear  from  the  definition  above  that 

P(r"({^})|Ei)  =  l.  (4.72) 


Definition  34  (Within-Fusion  Rule).  Let  S'  be  a  fixed  classification  system  with  N  fixed 
branches.  Assume  the  following: 

•  (E,  S’,  n)  is  a  probability  space; 

•  ££  =  {£i,  £2,  ,..,£ m}  is  a  finite  label  set; 

•  2*?e  is  consistent; 
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•  <%S£  —  {El, ,  E/>2, E^m  }  c  <§  is  the  partition  of  E  with  respect  to  Jz?  and  truth 
classifier  r; 

Let  Af>t  represent  the  branch  generated  by  fusion  rule  9L  If  for  each  m  =  1,  2, M,  the 
fixed  branches  Ai,  A2, AN  :  E  — >  Jz?  are  designed  to  map  E(m  to  £m,  then  the  fusion 
rule  91  is  said  to  be  a  within-fusion  rule.  Furthermore,  Aw  :  E  — >•  ££  is  designed  to  map 
Em  to  lm  for  each  m  =  1,  2, M. 

Definition  35  (Across-Fusion  Rule).  Let  S'  be  a  fixed  classification  system  with  N  fixed 
branches.  Assume  the  following: 

•  (E,  S’,  (i)  is  a  probability  space; 

•  Jz?  =  {fi,  f2, £m}  is  a  finite  label  set,  and  C  is  the  power  set  of  Jz?  so  that  (Jz?,  C) 
is  a  measurable  space; 

is  consistent; 

•  =  {E^,  E^2, E^m}  C  ^  is  a  partition  of  E  with  respect  to  Jz?  and  truth 
classifier  r; 

•  Jz?(0\  Jz?^, . . . ,  Jz?(jv)  C  £  are  (possibly  different)  partitions  of  Jz? ,  which  allow 
for  their  functor  categories  to  be  consistent,  each  under  a  different  truth  classifier, 
say  rn  for  n  =  0, 1, . . . ,  N; 

•  for  each  n  =  0, 1, . . . ,  N,  let  M ^  =  card(JS^)  <  M,  and  Jz?(n)  correspond  to 
the  label  set  L^  =  {cu[n^ ,  oj.]1'1  , . . . ,  uJ^\n) }  in  a  one-to-one  fashion; 

•  for  each  n  —  0,1 , . . .  ,N,  £‘(n')  C  S  is  the  partition  of  E  with  respect  to  Jz?*")  (and 
L*")  )  and  truth  classifier  rn; 

If  the  families  of  classification  systems, 

Aj  :  E  ->  L(1),A2  :  E  -»•  L^2\...,AN  :  E  -»■  L(JV), 

are  designed  to  map  each  partition  set  of  <£*")  to  the  corresponding  J"]  e  L*n)  for  every 
n  =  1,  2, . . . ,  N,  and  j  <  M(n),  then  the  fusion  rule  91  that  combines  the  collection  of 
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such  systems  (yielding  a  new  family  of  classification  systems), 


Ao  —  9t(A.i,  A2, Ajv), 

is  said  to  be  an  across-fusion  rule.  Furthermore,  A0  :  E  — >  L!0)  is  designed  to  map 
partition  sets  in  to  the  corresponding  lu^  e  L®,  for  j  <  M^\ 

The  diagram  of  across-fusion,  where  A^  represents  the  branch  which  is  essentially  a 
fused  branch,  is  shown  in  Figure  4.4.  If  the  partitions  are  equal  among  the  families  of 


L« - 


E 


L(°) 


L(2)  - ^  Jgf(2)  - Jgf(0) 


Figure  4.4:  Example  of  Across-Fusion. 


classification  systems  and  if  the  partitions  are  each  injective  to  j£f,  that  is, 


-s*”  =  ^  = . . .  =  J?<w)  =  {{M  .  {M  . ....  {<«}} 


so  that 

L(1)  =  K2)  =  . . .  =  =  {^,4,  ...,1m)  = 

then  there  is  no  need  to  consider  other  partitions  of  Jz? ,  since  clearly 

where  .c.'1  —  for  all  j  —  1, . . . ,  M .  Therefore,  within-fusion  is  a  special  case  of 
across-fusion. 
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4.5  Changing  Assumptions,  Robustness,  and  Example 

While  we  have  suggested  a  family  of  functionals  to  use  as  a  way  of  competing  clas¬ 
sification  systems  and  fusors,  this  family  is  not  the  only  choice  available.  Furthermore, 
one  may  desire  to  average  functionals  or  transform  them  into  new  functionals.  In  many 
ways,  the  functional  we  have  presented  is  general.  We  have  shown  its  relationship  to  the 
Bayes  optimal  and  Neyman-Pearson  points  on  a  ROC  curve.  It  can  also  be  shown  to  be 
related  to  Adam’s  and  Hand’s  development  of  a  loss  comparison  functional.  In  [3],  the 
loss  comparison  of  a  classification  system  (LC)  is  denoted  by 

LC  =  j  7(ci)L(ci)dci,  (4.73) 

where,  although  a  slight  abuse  of  notation,  we  have  /  as  an  indicator  function  of  whether 
or  not  the  classification  system  is  still  minimal  under  cost  c\,  and  <7  is  the  cost  of  one  type 
of  error  while  Co  is  the  cost  of  the  other.  L(c\)  is  a  belief  function  which  linearly  weights 
how  far  ci  is  from  the  believed  true  cost  of  the  error  (or  ratio  ^).  This  functional,  LC, 
can  be  reformulated  as  follows: 

Given  competing  classification  systems  R  =  {Aj}f=1  for  k  G  N  fixed,  fix 
a.  =  (ai,  a2)  and  7  =  (71, 72).  Let  T  be  the  set  of  all  possible  7.  Define  a 
set  H7  by 

H^=  {A j  e  R  |  F2(fAj;  7,  a)  >  F2(/A.;  7,  a),  V  i  ±  j,  i  =  1,  2, . . . ,  k). 

Then,  for  At  we  have  that 

LC{ Ai)  =  j  /By(Ai)W(7)d7  (4.74) 

where  W (7)  is  the  weight  given  to  supposition  7  (a  belief  function  in  this 
case).  Thus  LC  scores  the  classification  families,  and  induces  an  ordering  on 
R. 

One  more  suggested  use  of  Fn  would  be  to  apply  the  belief  function  in  a  simpler  way, 
and  average  Fn  over  the  believed  true  7  and  the  believed  extreme  values  of  the  set  T.  so 
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that 

2n 

Sn(fx)  =  ^—{Y,Fn{fK]li,oc)  +  Fn(/A;7o,a)),  (4.75) 

?=i 

where  7,-  are  the  believed  extreme  values  of  the  set  F.  and  70  is  the  most  believable 
(or  probable  under  some  instances)  cost-prior  product.  In  [3],  the  prior  probabilities 
are  assumed  to  be  fixed,  but  they  can  be  varied  according  to  belief  as  well  (although 
developing  the  belief  functions  will  prove  challenging). 

As  an  example,  consider  the  plot  of  two  competing  families  of  classification  systems 
in  Figure  4.5.  Since  we  collected  only  finite  data,  the  ROC  ‘curves’  are  actually  a  finite 
collection  of  ROC  points.  While  our  theory  develops  out  of  smooth  manifolds,  never¬ 
theless,  we  can  still  calculate  the  functionals  we  require,  since  they  operate  on  individual 
points  on  the  ROC  manifolds.  The  two  curves  in  question  cross  more  than  once,  and  this 
is  typical  of  many  ROC  curves,  so  deciding  which  family  of  classification  systems  is  best 
really  boils  down  to  which  classification  system  is  best.  Suppose  our  belief  of  the  situa¬ 
tion  we  are  trying  to  classify  is  that  the  ratio  of  prior  probabilities  ^jjry  is  with  with  a 
range  of  ratios  from  |  to  1.  Furthermore,  our  experts  believe  the  most  likely  cost  ratio  is 
=  1,  with  a  range  from  f  to  2.  Therefore,  our  prior-cost  ratio  is  most  likely  with 
a  range  from  |  to  2.  We  will  refer  to  the  two  ROC  curves  as  fr}  and  /Ca.  Hence,  the 
two  classification  systems  shown  in  the  figure  yield  scores  of  F2(fc1)  =  F2(fc2)  =  1-137, 
indicating  that  the  best  classification  systems  in  each  family  are  equivalent  with  regard 
to  the  most  believable  prior-cost  ratio.  However,  62 (/c, )  =  0.336  >  0.330  =  .S'2(./c2), 
indicating  a  preference  of  the  best  choice  from  fcx  once  belief  regarding  the  range  of  the 
prior-cost  ratio  is  taken  into  account.  If  our  beliefs  are  actual  probabilities  from  recorded 
data,  the  results  are  even  stronger  for  selecting  fCl  as  the  best  classification  system. 

There  are,  of  course,  other  suggestions  for  performance  functionals  regarding  com¬ 
peting  fusion  rules.  Consider  fusion  rules  as  algorithms,  divorcing  them  from  the  entire 
classification  system.  Mahler  [33]  recommends  using  mathematical  information  MoEs 
(measures  of  effectiveness)  with  respect  to  comparing  performance  of  fusion  algorithms 
(fusion  rules).  In  particular,  he  refers  to  level  1  fusion  MoEs  as  being  traditionally  ‘local- 
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Figure  4.5:  ROC  Curves  of  Two  Competing  Classifier  Systems. 

ized’  in  their  competence.  His  preferred  approach  is  to  use  an  information  ‘metric’,  the 
Kullback-Leibler  Discrimination  functional, 

K(fc,f)  =  I  fc(x)  log2 

where  fc  is  a  probability  distribution  of  perfect  or  near  perfect  ground  truth,  /  is  a  prob¬ 
ability  distribution  associated  with  the  fused  output  of  the  algorithm  and,  X  is  the  set  of 
all  possible  measurements  of  the  observation.  This  works  fine,  if  such  distributions  are 
at  hand.  One  drawback  is  that  it  measures  the  expected  value  of  uncertainty  and  there¬ 
fore  its  relationship  to  costs  and  prior  probabilities  is  obscure  (as  was  the  case  with  the 
Neyman-Pearson  criteria).  The  previous  functionals  we  have  forwarded  for  considera¬ 
tion  operate  on  families  of  classification  systems  (in  particular,  ROC  manifolds),  not  just 
systems  which  enjoy  well-developed  and  tested  probability  distribution  functions. 
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V.  Conclusions 

A  fusion  researcher  should  have  a  viable  method  of  competing  fusion  rules.  This  is  re¬ 
quired  to  correctly  define  fusion,  and  to  demonstrate  improvements  over  existing  methods. 
We  have  shown  in  this  dissertation  every  fusion  system  over  a  finite  number  of  fundamen¬ 
tal  classification  system  branches  can  generate,  under  test  conditions,  a  corresponding 
ROC  manifold,  and  under  a  mild  assumption  of  smoothness  of  the  ROC  manifold,  a  Bayes 
Optimal  Threshold  (BOT)  can  be  found  for  each  family  of  classification  systems.  Given 
additional  assumptions  on  the  a  priori  probabilities  of  a  target  or  non-target,  along  with 
given  thresholds  for  the  conditional  class  probabilities,  a  functional  can  be  generated  for 
each  ROC  manifold.  Any  such  functional  will  generate  a  partial  ordering  on  families  of 
classification  systems,  on  categories  of  fusion  rules,  and  ultimately  on  categories  of  fusors, 
which  can  then  be  used  to  select  the  best  fusor  from  among  a  finite  collection  of  fusors. 
We  demonstrate  one  such  functional,  the  ROC  functional,  which  is  scalable  to  ROC  mani¬ 
folds  of  dimensions  higher  than  1,  as  well  as  to  families  of  classification  systems  which  do 
not  generate  ROC  manifolds  at  all.  The  ROC  functional,  when  populated  with  the  appro¬ 
priate  data  choices,  will  yield  a  value  corresponding  the  the  Bayes  Optimal  threshold  with 
respect  to  the  classification  system  family  being  examined.  Another  data  choice  yields 
the  Bayes  Cost  Threshold,  and  we  have  also  shown  that  the  Neyman-Pearson  threshold  of 
a  classification  system  corresponds  to  the  output  of  the  ROC  functional  with  another  fixed 
data  choice  (so  that  it  will  correspond  with  the  Bayes  Optimal  Threshold  under  one  partic¬ 
ular  set  of  assumptions).  Ultimately,  a  researcher  could  choose  a  cost-prior  ratio  (which 
seems  most  reasonable)  perturbate  it,  calculate  the  mean  ROC  Functional  value,  and  then 
choose  the  classification  system  with  the  greatest  average  ROC  Functional  value.  This 
value  would  be  a  relative  comparison  of  how  robust  that  classification  system  is  to  changes 
(e.g.,  it  would  answer  the  question,  “how  much  change  is  endured  before  another  classifi¬ 
cation  system  is  optimal?”)  compared  with  other  classification  systems.  The  relationship 
of  the  ROC  functional  to  other  functionals,  including  the  loss  comparison  functional,  is 
demonstrated.  Finally,  there  are  other  functionals  to  choose,  one  which  we  mentioned, 
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the  Kullback-Leibler  discrimination  functional,  may  be  unrelated  to  the  ROC  functional, 
yet  may  be  suitable  in  particular  circumstances  where  prior  probabilities  and  costs  are  not 
fathomable,  but  probability  distributions  for  fusion  system  algorithms  and  ground  truth 
are  available. 

5.1  Significant  Contributions 

We  believe  that  significant  contributions  have  been  made  in  this  dissertation  to  the 
body  of  knowledge  referred  to  as  data  or  information  fusion.  The  contributions  of  new 
and  extended  applied  mathematics  were  made  in  the  following  presentations: 

•  Rigorous  Mathematical  descriptions  of: 

1.  classification  systems; 

2.  ROC  curves,  manifolds,  spaces. 

•  Extended  and  corrected  Alsing’s  ROC  convergence  theorem  [4].  Convergence  is 
shown  to  occur  almost  surely  as  countably  infinite  random  samples  are  taken  from 
test  sample  spaces,  the  sets  of  which  are  nested  and  converging  to  a  true  set  0. 
The  data  does  not  need  to  be  balanced  between  the  classes  as  assumed  by  Alsing. 
We  relied  upon  the  writings  of  Doob  [9],  Billingsley  [5],  and  Kolmogorov  [28] 
to  carefully  follow  the  subtle  differences  between  actual  experimental  data  and  its 
connection  to  the  theory  of  probability. 

•  Developed  a  ROC  functional,  Fn,  which  is  scalable  and  can  be  used  without  restric¬ 
tions  of  continuity,  differentiability,  convexity,  etc.,  which  were  necessary  to  the 
theory  of  finding  the  optimal  points  on  ROC  manifolds. 

1.  Demonstrated  its  relation  to  Bayes  Optimal  thresholds  and  Neyman-Pearson 
thresholds. 

2.  Constructed  a  more  robust  functional  from  the  ROC  functional  which  may  be 
even  more  useful  than  the  ROC  functional. 
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•  Proved  the  Min-Max  functional  is  a  minimum  two  norm  problem,  and  can  be  used 
without  restrictions  of  continuity,  differentiability,  convexity,  etc.,  which  were  nec¬ 
essary  to  the  theory  of  finding  the  optimal  points  on  ROC  manifolds. 

•  Demonstrated  the  pitfalls  associated  with  comparing  fusors  with  fixed  branches 
when  doing  across  fusion,  since  the  label  partitions  would  be  different  for  each 
classifier  family 

•  Developed  a  calculus  of  variations  solution  to  finding  optimal  elements  of  ROC 
manifolds  under  prior  probability  and  cost  constraints  for  finite  classes.  This  is  an 
extension  of  known  optimizations  with  two-class  problems,  which  used  differential 
calculus,  and  is  a  novel  approach  which  led  to  discovering  a  functional  that  works 
without  the  constraints  of  classifier  system  families  having  certain  well-behaved 
properties. 

•  Developed  the  Algebra/Category  Theory  of  the  fusion  of  classification  systems,  in¬ 
cluding  how  functors,  such  as  the  ROC  functional  and  minimum  norm  functional, 
are  natural  transformations  from  the  categories  of  fusion  rules,  and  fusors  to  a  par¬ 
tially  ordered  set.  Partial  orders  arise  naturally  with  an  objective  function,  thereby 
allowing  definitions  of  fusors  to  be  constructed,  as  well  as  defining  categories  of 
fusion  rules  and  fusors.  This  description  of  data  fusion  meets  the  desires  of  the  data 
fusion  communitee  as  cited  in  [54]. 

5.1.1  Recommendations  for  Follow-on  Research.  The  work  described  in  this 
dissertation  should  be  supplemented  with  the  following  ideas,  which  make  for  future  re¬ 
search: 

•  Find  universals  in  the  category  of  fusors.  We  suspect  that  the  truth  fusor  and  false 
fusor  along  with  the  arrow  induced  by  the  partial  order  may  be  universal  in  some 
way; 
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•  Allow  the  categories  to  propagate  arrows,  such  as  an  arrow  representing  time  in 
the  event-state  category.  In  this  way,  stochastic  processes  may  be  modeled  and 
explained  better; 

•  There  is  a  need  to  define  what  situation/threat  refinements  are  in  order  to  apply  this 
fusion  process  foundation  to  the  elevated  levels  of  data  fusion,  as  described  in  the 
JDL  functional  model; 

•  Fnd  the  common  theory  behind  functionals,  such  as  the  ROC  functional,  and  the 
information  measures  of  effectiveness,  such  as  the  Liebler-Kullback  cross  entropy. 
Also  needed  is  the  full  relationship  between  the  ROC  functional  and  the  AUC; 

•  The  robustness  of  the  classification  systems  which  are  minimum  with  respect  to  an 
objective  function  needs  to  be  explored  further,  as  well  as,  examining  the  possibility 
that  costs  are  not  fixed  constants,  but  rather  they  are  functions  of  the  error  axes 
themselves.  Then  what  is  the  minimizing  argument?  Is  there  a  way  to  find  this 
point  on  the  ROC  manifold? 

•  Develop  and  seek  out  applications  for  which  our  theory  explains  and  describes  the 
process.  Our  desire  is  to  build  up  the  examples  in  order  to  make  the  explanations 
more  useful  and  relevant  to  those  not  versed  in  category  theory,  but  for  whom  this 
research  would  be  beneficial. 

This  short  list  is  not  comprehensive,  but  gives  a  few  good  topics  both  within  category 
theory  and  linear  operator  theory  to  expand  the  state  of  our  current  knowledge. 
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